Staff Engineer (Kubernetes/AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Engineer (Kubernetes/AI): Building and scaling a managed Kubernetes platform purpose-built for AI workloads with an accent on bare-metal orchestration, GPU-aware scheduling, and high-performance networking. Focus on designing holistic infrastructure solutions that integrate compute, storage, and security to power next-generation AI training and inference at scale.
Location: Must be based in or able to commute to Bellevue, WA, San Francisco, CA, or San Jose, CA (4 days per week in-office)
Salary: $314,000 – $465,000
Company
is a leader in AI cloud infrastructure, providing high-performance GPU compute to researchers and enterprises to make superintelligence ubiquitous.
What you will do
- Drive the technical vision for a managed Kubernetes bare-metal platform, focusing on scalability, multi-tenancy, and lifecycle management.
- Integrate and extend NVIDIA's open-source ecosystem, including GPU Operator, DCGM, and topology-aware scheduling.
- Design and build higher-level platform services for inference, including autoscaling and multi-model deployment patterns.
- Collaborate across infrastructure teams to define networking (RDMA, InfiniBand) and storage requirements for AI workloads.
- Lead technical design sessions, mentor engineers, and establish best practices for distributed systems and Cloud Native engineering.
- Build self-healing systems and automation for incident response and platform resilience at scale.
Requirements
- 10+ years of experience in software, platform, or SRE, with 5+ years focused on Kubernetes at scale.
- Expert-level understanding of Kubernetes internals (API machinery, controllers, CNI, CSI).
- Strong software engineering skills in Go (required) and Python.
- Deep experience with GPU orchestration (NVIDIA GPU Operator, DCGM, MIG).
- Holistic infrastructure expertise spanning compute, networking, storage, and security.
- Must be able to work from the Bellevue, San Francisco, or San Jose office 4 days per week.
Nice to have
- Experience building managed Kubernetes services (GKE, EKS, AKS).
- Familiarity with HPC job schedulers like Slurm.
- Contributions to CNCF projects or NVIDIA open-source projects.
- Background in ML infrastructure, training clusters, or inference serving.
Culture & Benefits
- Generous cash and equity compensation packages.
- Comprehensive health, dental, and vision coverage for employees and dependents.
- 401k plan with 2% company match.
- Flexible paid time off policy.
- Opportunity to work with cutting-edge AI infrastructure and NVIDIA's latest technology.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →