Distributed Training Engineer (AI)

Формат работы

remote (только USA)/hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Distributed Training Engineer (AI): Developing and optimizing large-scale distributed LLM training systems for scientific research with an accent on distributed training frameworks and high-throughput GPU cluster performance. Focus on debugging complex training workflows, contributing to open-source frameworks, and supporting frontier-scale experiments in a high-impact lab environment.

Location: Based in Menlo Park, California, or remote within the United States.

Company

An AI and physical sciences lab building state-of-the-art models to accelerate novel scientific discoveries.

What you will do

Optimize, operate, and develop large-scale distributed LLM training systems.
Collaborate with researchers to bring up, debug, and maintain training and reinforcement learning workflows.
Build tools to support frontier-scale experiments in physics and materials science.
Contribute to open-source large-scale LLM training frameworks.
Maintain system performance for massive-scale model development.

Requirements

Experience training models on clusters with 5,000 or more GPUs.
Proficiency with 5D parallel LLM training.
Expertise in distributed training frameworks like Megatron-LM, FSDP, DeepSpeed, or TorchTitan.
Ability to optimize training throughput for large-scale Mixture-of-Expert models.
Must be based in the United States.

Culture & Benefits

Work in a well-funded, rapidly growing lab environment.
Ownership-based culture with minimal bureaucracy.
Opportunities to learn new tools at the intersection of AI and physical sciences.
Direct contribution to groundbreaking scientific research.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →