Operations Engineer, Fleet Reliability (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Operations Engineer, Fleet Reliability (AI): Maintaining and scaling high-performance GPU clusters for a generative media ecosystem with an accent on fleet reliability, hardware troubleshooting, and automation. Focus on provisioning B300/H200/H100 nodes, solving complex NVLink/NCCL issues, and building observability pipelines.
Location: Remote
Company
is the generative media ecosystem powering the next generation of AI products, providing the infrastructure, tools, and model access needed for high-performance inference and orchestration at scale.
What you will do
- Provision, validate, and triage GPU nodes across B300, H200, and H100 clusters
- Troubleshoot hardware and software issues across compute, network, and storage
- Monitor fleet health, execute remediation actions, and push fixes upstream
- Automate repetitive operational tasks using scripting to improve efficiency
- Develop and maintain operational runbooks to standardize system recovery
Requirements
- Experience administering Linux systems in the critical path
- Proven track record troubleshooting GPU node issues (NVLink, NCCL, IB, driver, and firmware bugs)
- Experience with observability systems like Grafana and Prometheus
- Strong scripting skills in Bash, Python, or Go
- Comfort with on-call rotations and working in ambiguous environments
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →