Engineering Manager, Fleet Reliability (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Engineering Manager, Fleet Reliability (Infrastructure/AI): Building and managing the fleet reliability function to ensure GPU nodes are provisioned and healthy with an accent on automation and SRE fundamentals. Focus on driving the automation roadmap, defining production SLAs, and scaling the fleet by 10x.
Location: Remote
Company
is a generative media ecosystem providing the infrastructure, tools, and model access necessary for teams to move AI products from idea to production at scale.
What you will do
- Build and lead the Fleet Reliability team, including hiring, developing, and retaining talent.
- Own 24/7 coverage for node provisioning, validation, and triage.
- Drive the automation roadmap, focusing on event-driven remediation and self-healing systems.
- Define and enforce SLAs to ensure production GPUs effectively serve traffic.
- Establish team culture, communication standards, and growth metrics.
Requirements
- 7+ years of experience in infrastructure, software, or SRE.
- 2+ years of experience leading a fleet reliability or hardware ops team in a production environment.
- Experience building SRE fundamentals from scratch, including incident management and observability.
- Strong mindset focused on automation over manual toil.
- Ability to operate as a player-coach, including carrying the pager.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →