Назад
Company hidden
3 дня назад

Senior Incident Manager (AI)

125 000 - 195 000$
Формат работы
remote (только USA)/hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Incident Manager (AI Infrastructure): Leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and GPU clusters with an accent on rapid triage, cross-team coordination, and structured post-incident analysis. Focus on reducing MTTR, identifying systemic reliability gaps, and implementing corrective actions to improve operational resilience.

Location: Remote (USA) or Hybrid (San Jose, CA)

Salary: $125,000 – $195,000

Company

hirify.global is a leader in AI cloud infrastructure providing high-performance compute power for AI researchers and enterprises.

What you will do

  • Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, and data center operations.
  • Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
  • Own the entire incident response lifecycle, including technical triage, escalation, resolution, and post-incident reviews.
  • Conduct root cause analysis (RCA) to identify systemic reliability gaps and drive corrective actions.
  • Track and report critical metrics such as MTTR, MTTD, and incident recurrence rates.
  • Develop and maintain operational playbooks, runbooks, and reliability frameworks.

Requirements

  • 8+ years of experience in incident management, site reliability engineering (SRE), or infrastructure operations.
  • Experience managing incidents in large-scale distributed infrastructure environments.
  • Strong understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
  • Proficiency with tools such as PagerDuty, ServiceNow, Jira, Datadog, and Prometheus/Grafana.
  • Must be based in the USA.

Nice to have

  • Experience operating AI or HPC infrastructure, specifically NVIDIA clusters and InfiniBand networks.
  • Background in hyperscale or colocation data center environments.
  • Familiarity with the Incident Command System (ICS).
  • Experience building and developing incident command processes from scratch.

Culture & Benefits

  • Generous cash and equity compensation packages.
  • Comprehensive health, dental, and vision coverage for employees and dependents.
  • 401k Plan with 2% company match for USA employees.
  • Flexible paid time off (PTO) plan.
  • Wellness and commuter stipends for select roles.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →