Senior Engineer - HPC Operations (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Engineer - HPC Operations (AI): Overseeing the daily operations and support of high-performance computing clusters for large-scale AI and ML workloads with an accent on infrastructure stability and performance. Focus on optimizing GPU resource management, automating HPC environments, and implementing cutting-edge MLOps platforms.
Location: Must be based in the United States
Salary: US$106,400 – US$159,600 per year
Company
is a leader in AI-powered cloud and digital infrastructure, empowering clients to harness sovereign AI infrastructure globally.
What you will do
- Lead operational support for HPC infrastructure, including compute, storage, networking, and schedulers like Slurm and Kubernetes.
- Maximize efficiency and performance of HPC systems to ensure optimal resource utilization and minimal downtime.
- Act as the primary technical escalation point for L2 support teams to ensure prompt incident resolution.
- Monitor system health and performance using tools such as Prometheus, Grafana, and DCGM.
- Manage AI/ML user environments using container orchestration (Docker, Kubernetes) and workflow tools (MLflow, Kubeflow).
- Perform root cause analysis (RCA) of operational issues and provide mentorship to junior engineers.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
- 5+ years of experience in HPC operations, systems engineering, or DevOps roles.
- Advanced expertise in configuring and maintaining complex HPC environments, including Slurm clusters and Kubernetes.
- Expert knowledge of GPU resource management and performance tuning for AI/ML workloads.
- Strong scripting and automation skills using Python, Bash, Ansible, and Terraform.
- In-depth understanding of Linux (RHEL/CentOS/Ubuntu), networking (RDMA, InfiniBand, RoCE), and storage (NFS, Lustre, Ceph).
Culture & Benefits
- Inclusive and collaborative environment with a diverse team of 1,100+ employees from 68 nationalities.
- Culture based on values of Grit, Passion, and Impact.
- Commitment to diversity, equity, and inclusion as an equal opportunity employer.
- Comprehensive compensation package including base salary, bonus, and benefits.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →