Engineering Manager, Fleet Reliability (AI)

Формат работы

remote

Тип работы

fulltime

Грейд

lead

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Engineering Manager, Fleet Reliability (Infrastructure/AI): Building and managing the fleet reliability function to ensure GPU nodes are provisioned and healthy with an accent on automation and SRE fundamentals. Focus on driving the automation roadmap, defining production SLAs, and scaling the fleet by 10x.

Location: Remote

Company

hirify.global is a generative media ecosystem providing the infrastructure, tools, and model access necessary for teams to move AI products from idea to production at scale.

What you will do

Build and lead the Fleet Reliability team, including hiring, developing, and retaining talent.
Own 24/7 coverage for node provisioning, validation, and triage.
Drive the automation roadmap, focusing on event-driven remediation and self-healing systems.
Define and enforce SLAs to ensure production GPUs effectively serve traffic.
Establish team culture, communication standards, and growth metrics.

Requirements

7+ years of experience in infrastructure, software, or SRE.
2+ years of experience leading a fleet reliability or hardware ops team in a production environment.
Experience building SRE fundamentals from scratch, including incident management and observability.
Strong mindset focused on automation over manual toil.
Ability to operate as a player-coach, including carrying the pager.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →