Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Software Engineer (AI Runtime): Building and scaling a managed GPU training platform for large-scale AI model training and fine-tuning with an accent on multi-node orchestration and distributed parallelism. Focus on optimizing training throughput, ensuring system resilience through failure detection, and maximizing GPU utilization across diverse hardware.
Location: Must be based in Mountain View or San Francisco, California
Salary: $160,000 — $225,000 USD
Company
Databricks is a data and AI company providing a Data Intelligence Platform used by over 10,000 organizations to unify data, analytics, and AI.
What you will do
- Drive the architecture and evolution of the AI Runtime (AIR) managed GPU training platform for fleets of thousands of accelerators.
- Solve complex challenges in multi-node orchestration, distributed parallelism strategies, and GPU scheduling.
- Optimize GPU efficiency, increasing model FLOPs utilization and overall end-to-end throughput.
- Develop resilience and observability foundations to detect and recover from hardware and software failures automatically.
- Collaborate with product and research teams to design the APIs, CLI, and developer experience for production training jobs.
- Mentor other engineers and lead end-to-end engineering efforts from design to production rollout.
Requirements
- 5+ years of experience building large-scale distributed systems, GPU training infrastructure, or ML systems.
- Proficiency with distributed training frameworks such as PyTorch, FSDP, DeepSpeed, or Megatron.
- Deep understanding of GPU performance, including accelerator architecture, NVLink, InfiniBand, or RoCE.
- Experience operating managed multi-tenant cloud platform products with strict SLAs and SLOs.
- Strong foundation in algorithms, data structures, and performance-sensitive system design.
- BS in Computer Science or a related field (MS or PhD preferred).
Culture & Benefits
- Comprehensive benefits and perks tailored to the region.
- Opportunity to work on frontier-scale foundation models and cutting-edge AI infrastructure.
- Collaborative environment partnering across product, research, and platform teams.
- Commitment to diversity, inclusion, and equal employment opportunity.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →