Назад
Company hidden
13 часов назад

Senior Engineer - HPC Operations (AI Infrastructure)

106 400 - 159 600$
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Engineer - HPC Operations (AI/ML): Oversee daily operations and support of high-performance computing clusters designed to power large-scale AI and ML workloads with an accent on stability, security, and high performance. Focus on optimizing resource utilization, resolving incidents, and automating processes in globally distributed environments.

Location: United States

Salary: US$106,400 to US$159,600 per year

Company

hirify.global is a leader in AI-powered cloud and digital infrastructure, empowering clients to harness sovereign AI infrastructure in regulated sectors.

What you will do

  • Lead operational support of HPC infrastructure including compute, storage, networking, and schedulers like Slurm and Kubernetes.
  • Maximize efficiency and performance of HPC systems for optimal resource utilization and minimal downtime.
  • Serve as primary escalation point for L2 support, resolving incidents and service requests.
  • Monitor system health using tools like Prometheus, Grafana, and DCGM.
  • Manage user environments for AI/ML workloads with container orchestration and workflow tools like MLflow and Kubeflow.
  • Implement job scheduling policies in Slurm/Kubernetes and conduct root cause analysis for issues.

Requirements

  • Bachelor’s or Master’s in Computer Science, Engineering, or related field.
  • 5+ years in HPC operations, systems engineering, or DevOps.
  • Advanced expertise in HPC environments, Slurm/Kubernetes for AI/ML, GPU management, and performance tuning.
  • Experience with monitoring tools (Prometheus, Grafana, DCGM).
  • Strong scripting/automation (Python, Bash, Ansible, Terraform).
  • Deep knowledge of Linux, networking (RDMA, InfiniBand, RoCE), and storage (NFS, Lustre, Ceph).

Culture & Benefits

  • Diverse team of 1,100+ employees from 68 nationalities in an inclusive, innovative environment.
  • Values of Grit, Passion, and Impact driving resilience, excellence, and progress.
  • Bonus and benefits on top of base salary; equal opportunity employer compliant with ADA.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →