Назад
Company hidden
1 день назад

Senior Engineer - HPC Operations (AI)

106 400 - 159 600$
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Engineer - HPC Operations (AI): Overseeing the daily operations and support of high-performance computing clusters for large-scale AI and ML workloads with an accent on infrastructure stability and performance. Focus on optimizing GPU resource management, automating HPC environments, and implementing cutting-edge MLOps platforms.

Location: Must be based in the United States

Salary: US$106,400 – US$159,600 per year

Company

hirify.global is a leader in AI-powered cloud and digital infrastructure, empowering clients to harness sovereign AI infrastructure globally.

What you will do

  • Lead operational support for HPC infrastructure, including compute, storage, networking, and schedulers like Slurm and Kubernetes.
  • Maximize efficiency and performance of HPC systems to ensure optimal resource utilization and minimal downtime.
  • Act as the primary technical escalation point for L2 support teams to ensure prompt incident resolution.
  • Monitor system health and performance using tools such as Prometheus, Grafana, and DCGM.
  • Manage AI/ML user environments using container orchestration (Docker, Kubernetes) and workflow tools (MLflow, Kubeflow).
  • Perform root cause analysis (RCA) of operational issues and provide mentorship to junior engineers.

Requirements

  • Bachelor’s or Master’s degree in Computer Science, Engineering, or a related technical field.
  • 5+ years of experience in HPC operations, systems engineering, or DevOps roles.
  • Advanced expertise in configuring and maintaining complex HPC environments, including Slurm clusters and Kubernetes.
  • Expert knowledge of GPU resource management and performance tuning for AI/ML workloads.
  • Strong scripting and automation skills using Python, Bash, Ansible, and Terraform.
  • In-depth understanding of Linux (RHEL/CentOS/Ubuntu), networking (RDMA, InfiniBand, RoCE), and storage (NFS, Lustre, Ceph).

Culture & Benefits

  • Inclusive and collaborative environment with a diverse team of 1,100+ employees from 68 nationalities.
  • Culture based on values of Grit, Passion, and Impact.
  • Commitment to diversity, equity, and inclusion as an equal opportunity employer.
  • Comprehensive compensation package including base salary, bonus, and benefits.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →