Senior Engineer - HPC Operations (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Engineer - HPC Operations (AI/ML): Oversee daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads with an accent on stability, security, and high performance. Focus on optimizing resource utilization, resolving incidents, and automating processes in globally distributed environments.
Location: United States
Salary: US$106,400 to US$159,600 per year
Company
is a leader in AI-powered cloud and digital infrastructure, empowering clients to harness sovereign AI infrastructure in regulated sectors.
What you will do
- Lead operational support of HPC infrastructure including compute, storage, networking, and schedulers like Slurm and Kubernetes.
- Maximize efficiency and performance of HPC systems for optimal resource utilization and minimal downtime.
- Serve as primary escalation point for L2 support, resolving incidents and service requests.
- Monitor system health using tools like Prometheus, Grafana, and DCGM.
- Manage user environments for AI/ML workloads with container orchestration and workflow tools like MLflow and Kubeflow.
- Implement job scheduling policies in Slurm/Kubernetes and conduct root cause analysis for issues.
Requirements
- Bachelor’s or Master’s in Computer Science, Engineering, or related field.
- 5+ years in HPC operations, systems engineering, or DevOps.
- Advanced expertise in HPC environments, Slurm/Kubernetes for AI/ML, GPU management, and performance tuning.
- Experience with monitoring tools (Prometheus, Grafana, DCGM).
- Strong scripting/automation (Python, Bash, Ansible, Terraform).
- Deep knowledge of Linux, networking (RDMA, InfiniBand, RoCE), and storage (NFS, Lustre, Ceph).
Culture & Benefits
- Diverse team of 1,100+ employees from 68 nationalities in an inclusive, innovative environment.
- Values of Grit, Passion, and Impact driving resilience, excellence, and progress.
- Bonus and benefits on top of base salary; equal opportunity employer compliant with ADA.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →