Staff Software Engineer (AI Runtime)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Software Engineer (AI Runtime): Building and scaling the managed GPU training platform for large-scale AI models with an accent on distributed training performance and fault tolerance. Focus on designing multi-node orchestration, optimizing GPU efficiency, and developing resilience foundations for frontier-scale foundation models.
Location: Mountain View, California or San Francisco, California
Salary: $190,000 — $265,000 USD
Company
is a data and AI company providing a Data Intelligence Platform that unifies data, analytics, and AI for over 10,000 organizations worldwide.
What you will do
- Drive the architecture and evolution of the AI Runtime (AIR) managed GPU training platform for scalable, high-throughput training.
- Solve complex problems in multi-node orchestration, distributed parallelism strategies, and GPU scheduling.
- Optimize GPU efficiency and training performance to raise utilization and lower cost per training run.
- Build resilience and observability foundations to detect and recover from hardware and software failures.
- Partner with product and research teams to shape APIs, CLI, and the developer experience for production training jobs.
- Mentor senior engineers and champion engineering excellence to shape the long-term technical direction of AI training infrastructure.
Requirements
- 10+ years of experience building and operating large-scale distributed systems, GPU training infrastructure, or ML systems.
- Hands-on experience with distributed training frameworks such as PyTorch, FSDP, DeepSpeed, or Megatron.
- Deep understanding of parallelism strategies (data, tensor, pipeline, and sequence parallelism).
- Strong grasp of GPU performance fundamentals, including NVLink, InfiniBand, and collective communication.
- Experience building managed, multi-tenant cloud platform products with clear SLAs and SLOs.
- BS in Computer Science or a related field (MS or PhD preferred).
Culture & Benefits
- Comprehensive benefits and perks tailored to the employee's region.
- Opportunity to work on the most demanding workloads in computing, including frontier-scale foundation models.
- Collaborative environment partnering across product, research, and platform teams.
- Commitment to diversity, inclusion, and equal employment opportunity standards.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →