Distributed Training Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Distributed Training Engineer (AI): Developing and optimizing large-scale distributed LLM training systems for scientific research with an accent on distributed training frameworks and high-throughput GPU cluster performance. Focus on debugging complex training workflows, contributing to open-source frameworks, and supporting frontier-scale experiments in a high-impact lab environment.
Location: Based in Menlo Park, California, or remote within the United States.
Company
An AI and physical sciences lab building state-of-the-art models to accelerate novel scientific discoveries.
What you will do
- Optimize, operate, and develop large-scale distributed LLM training systems.
- Collaborate with researchers to bring up, debug, and maintain training and reinforcement learning workflows.
- Build tools to support frontier-scale experiments in physics and materials science.
- Contribute to open-source large-scale LLM training frameworks.
- Maintain system performance for massive-scale model development.
Requirements
- Experience training models on clusters with 5,000 or more GPUs.
- Proficiency with 5D parallel LLM training.
- Expertise in distributed training frameworks like Megatron-LM, FSDP, DeepSpeed, or TorchTitan.
- Ability to optimize training throughput for large-scale Mixture-of-Expert models.
- Must be based in the United States.
Culture & Benefits
- Work in a well-funded, rapidly growing lab environment.
- Ownership-based culture with minimal bureaucracy.
- Opportunities to learn new tools at the intersection of AI and physical sciences.
- Direct contribution to groundbreaking scientific research.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →