Operations Engineer, Fleet Reliability (AI)

Формат работы

remote

Тип работы

fulltime

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Operations Engineer, Fleet Reliability (AI): Maintaining and scaling high-performance GPU clusters for a generative media ecosystem with an accent on fleet reliability, hardware troubleshooting, and automation. Focus on provisioning B300/H200/H100 nodes, solving complex NVLink/NCCL issues, and building observability pipelines.

Location: Remote

Company

hirify.global is the generative media ecosystem powering the next generation of AI products, providing the infrastructure, tools, and model access needed for high-performance inference and orchestration at scale.

What you will do

Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
Troubleshoot hardware and software issues across compute, network, and storage
Monitor fleet health, execute remediation actions, and push fixes upstream
Automate repetitive operational tasks using scripting to improve efficiency
Develop and maintain operational runbooks to standardize system recovery

Requirements

Experience administering Linux systems in the critical path
Proven track record troubleshooting GPU node issues (NVLink, NCCL, IB, driver, and firmware bugs)
Experience with observability systems like Grafana and Prometheus
Strong scripting skills in Bash, Python, or Go
Comfort with on-call rotations and working in ambiguous environments

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →