Назад
Company hidden
1 месяц назад

Operations Engineer, Fleet Reliability (AI)

Формат работы
remote
Тип работы
fulltime
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Operations Engineer, Fleet Reliability (AI): Maintaining and scaling high-performance GPU clusters for a generative media ecosystem with an accent on fleet reliability, hardware troubleshooting, and automation. Focus on provisioning B300/H200/H100 nodes, solving complex NVLink/NCCL issues, and building observability pipelines.

Location: Remote

Company

hirify.global is the generative media ecosystem powering the next generation of AI products, providing the infrastructure, tools, and model access needed for high-performance inference and orchestration at scale.

What you will do

  • Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
  • Troubleshoot hardware and software issues across compute, network, and storage
  • Monitor fleet health, execute remediation actions, and push fixes upstream
  • Automate repetitive operational tasks using scripting to improve efficiency
  • Develop and maintain operational runbooks to standardize system recovery

Requirements

  • Experience administering Linux systems in the critical path
  • Proven track record troubleshooting GPU node issues (NVLink, NCCL, IB, driver, and firmware bugs)
  • Experience with observability systems like Grafana and Prometheus
  • Strong scripting skills in Bash, Python, or Go
  • Comfort with on-call rotations and working in ambiguous environments

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →