Software Engineer (ML Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Software Engineer (ML Infrastructure): Architecting and scaling a high-performance training platform for large-scale GPU clusters with an accent on multi-tenant orchestration and scheduling primitives. Focus on optimizing GPU utilization, ensuring system reliability for multi-thousand GPU workloads, and integrating CNCF ecosystem tools.
Location: Must be based in the United States (based on US Department of Labor compliance and commuter benefits).
Company
develops reliable AI systems and high-quality data technologies that power the world's leading models for enterprises and governments.
What you will do
- Architect and scale a multi-tenant orchestration layer for GPU clusters to ensure high utilization and seamless job recovery.
- Design and implement scheduling primitives to optimize the lifecycle of training jobs.
- Develop deep observability and automated health-checking into the training stack to isolate hardware failures.
- Evaluate and integrate emerging technologies in the CNCF and AI ecosystem, such as Ray and Kueue.
- Collaborate with Finance and Procurement teams to drive the capacity planning process.
- Own projects end-to-end, from requirements and scoping to implementation.
Requirements
- 5+ years of experience in backend or infrastructure engineering.
- At least 2 years of experience orchestrating ML workloads at scale (100+ GPU nodes).
- Strong programming skills in Python, Go, Rust, or C++.
- Expert-level knowledge of Kubernetes internals, including Custom Resources, Operators, and Admission Controllers.
- Experience with distributed training infrastructure (EFA, Infiniband) and distributed storage (Lustre, S3).
- Must have valid work authorization for the United States.
Nice to have
- Experience with distributed training techniques such as DeepSpeed or FSDP.
- Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
- Experience with PyTorch.
- Familiarity with post-training algorithms like GRPO and Reinforcement Learning.
Culture & Benefits
- Comprehensive health, dental, and vision coverage.
- Retirement benefits and a learning and development stipend.
- Generous PTO and potential commuter stipend.
- Inclusive and equal opportunity workplace committed to diversity.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →