Staff Technical Program Manager (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Technical Program Manager (AI): Leading cross-functional programs for cluster orchestration and applied training within an AI/ML platform with an accent on workload scheduling, reliability, and scalability. Focus on improving how AI training and evaluation workflows run across clusters and integrating orchestration systems like Slurm-on-Kubernetes and Kueue.
Location: Hybrid in Bellevue, WA. Remote may be considered for candidates based in the US. Must be a U.S. person (citizen, lawful permanent resident, etc.) for export control compliance.
Salary: $237,000 – $261,000
Company
is a specialized cloud provider (The Essential Cloud for AI™) delivering high-performance infrastructure for AI labs and enterprises.
What you will do
- Drive end-to-end execution for cluster orchestration, including scheduling, self-service provisioning, and migration flows.
- Lead programs to optimize AI training, evaluation, and reinforcement learning workloads across clusters.
- Partner with engineering and product leaders to define roadmaps and improve cluster utilization, reliability, and scalability.
- Coordinate dependencies across platform engineering, infrastructure, and ecosystem partners for successful launches.
- Establish success metrics, dashboards, and operating cadences for cluster efficiency and workload performance.
- Align stakeholders and resolve technical tradeoffs for ambiguous high-impact programs.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or equivalent practical experience.
- 8+ years of TPM experience in cloud infrastructure, distributed systems, or AI/ML platforms.
- Technical fluency in Kubernetes, Slurm, and distributed systems.
- Experience leading large-scale cross-functional programs for ML platform capabilities.
- Must be a U.S. person for export control compliance.
Nice to have
- Experience with orchestration technologies such as Kueue, Ray, or similar systems.
- Familiarity with GPU infrastructure, cluster capacity planning, and multi-tenant execution.
- Experience with AI research tooling such as W&B or SkyPilot.
Culture & Benefits
- 100% company-paid medical, dental, and vision insurance.
- 401(k) with generous employer match and Employee Stock Purchase Program (ESPP).
- Flexible PTO and paid parental leave.
- Mental wellness benefits through Spring Health and family support via Carrot.
- Catered lunch daily in office and data center locations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →