Lead Engineer - HPC Operations
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Lead Engineer - HPC Operations (AI/ML): Oversee daily operations and support of high-performance computing clusters powering large-scale AI and ML workloads with an accent on infrastructure optimization, automation, and resource utilization. Focus on managing Slurm and Kubernetes environments, resolving incidents, tuning GPU performance, and leading root cause analysis.
Location: United States
Salary: US$133,200 to US$199,800 per year
Company
is a leader in AI-powered cloud and digital infrastructure, enabling sovereign AI solutions for regulated sectors globally.
What you will do
- Oversee operational management of HPC infrastructure including compute, storage, networking, and schedulers like Slurm and Kubernetes.
- Optimize system efficiency, performance, and resource utilization while minimizing downtime.
- Serve as escalation point for L2 support, conduct root cause analysis, and drive improvements.
- Monitor system health using Prometheus, Grafana, and DCGM.
- Manage AI/ML user environments with Docker, Kubernetes, MLflow, and Kubeflow.
- Define job scheduling policies and mentor junior engineers while participating in on-call.
Requirements
- Bachelor’s or Master’s in Computer Science, Engineering, or related field.
- 8+ years in HPC operations, systems engineering, or DevOps; 2+ years in leadership.
- Expertise in HPC environments, Slurm/Kubernetes for AI/ML, GPU management, and performance tuning.
- Proficiency in Prometheus, Grafana, DCGM, Python, Bash, Ansible, Terraform.
- Strong Linux, networking (RDMA, InfiniBand, RoCE), and storage (NFS, Lustre, Ceph) knowledge.
Culture & Benefits
- Diverse team of 1,100+ from 68 nationalities in an inclusive, innovative environment.
- Values of Grit, Passion, and Impact fostering trust, accountability, and high performance.
- Bonus and benefits on top of base salary; equal opportunity employer with ADA accommodations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →