Senior Incident Manager (AI)

125 000 - 195 000$

Формат работы

remote (только USA)/hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Incident Manager (AI Infrastructure): Leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and GPU clusters with an accent on rapid triage, cross-team coordination, and structured post-incident analysis. Focus on reducing MTTR, identifying systemic reliability gaps, and implementing corrective actions to improve operational resilience.

Location: Remote (USA) or Hybrid (San Jose, CA)

Salary: $125,000 – $195,000

Company

hirify.global is a leader in AI cloud infrastructure providing high-performance compute power for AI researchers and enterprises.

What you will do

Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, and data center operations.
Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
Own the entire incident response lifecycle, including technical triage, escalation, resolution, and post-incident reviews.
Conduct root cause analysis (RCA) to identify systemic reliability gaps and drive corrective actions.
Track and report critical metrics such as MTTR, MTTD, and incident recurrence rates.
Develop and maintain operational playbooks, runbooks, and reliability frameworks.

Requirements

8+ years of experience in incident management, site reliability engineering (SRE), or infrastructure operations.
Experience managing incidents in large-scale distributed infrastructure environments.
Strong understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
Proficiency with tools such as PagerDuty, ServiceNow, Jira, Datadog, and Prometheus/Grafana.
Must be based in the USA.

Nice to have

Experience operating AI or HPC infrastructure, specifically NVIDIA clusters and InfiniBand networks.
Background in hyperscale or colocation data center environments.
Familiarity with the Incident Command System (ICS).
Experience building and developing incident command processes from scratch.

Culture & Benefits

Generous cash and equity compensation packages.
Comprehensive health, dental, and vision coverage for employees and dependents.
401k Plan with 2% company match for USA employees.
Flexible paid time off (PTO) plan.
Wellness and commuter stipends for select roles.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →