Staff HPC Systems Software Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff HPC Systems Software Engineer (Go/Python): Defining the technical direction and evolution of a Slurm-based HPC platform for a GPU cloud with an accent on cloud-native integration and automation. Focus on designing scalable cluster lifecycle models, optimizing GPU scheduling, and ensuring high-performance networking using InfiniBand and RDMA.
Location: Must be based in the US
Salary: $225,000 - $275,000 USD
Company
is a high-performance GPU cloud provider engineered specifically to support AI startups and large enterprise customers.
What you will do
- Own and evolve the technical direction for HPC systems domains, including Slurm platform architecture and scheduler integrations.
- Define architectural patterns for packaging, automating, and exposing Slurm implementations as a service.
- Drive cross-team design for integrations between Slurm, Kubernetes-adjacent systems, and infrastructure APIs.
- Establish shared standards for automation, observability, reliability, and supportability across the HPC platform.
- Lead technically critical initiatives spanning multiple engineering teams to reduce technical divergence.
- Contribute hands-on code to de-risk critical work and provide technical escalation for complex domain issues.
Requirements
- Extensive experience designing production software for HPC systems and Slurm-based environments.
- Strong proficiency in Go, Python, or similar languages for building resilient and testable software.
- Deep practical understanding of GPU-backed infrastructure and HPC networking (InfiniBand, RoCE, RDMA).
- Proven ability to define technical direction across multiple teams or complex system domains.
- Experience integrating HPC systems with cloud-native platforms and service delivery models.
- Must be based in the United States
Nice to have
- Experience with Kueue or other batch system schedulers.
Culture & Benefits
- Highly competitive US compensation package including base salary, bonus, and equity.
- Performance reviews conducted every 12 months with a dynamic progression plan.
- Human-First Flexibility policy providing autonomy to shape your own workday.
- Comprehensive benefits package including medical, dental, vision, and retirement plan participation.
- Opportunity to directly shape how global AI capacity is planned and deployed.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →