Senior Incident Manager (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Incident Manager (AI Infrastructure): Leading the end-to-end lifecycle of operational incidents impacting AI infrastructure and GPU clusters with an accent on rapid triage, cross-team coordination, and structured post-incident analysis. Focus on reducing MTTR, identifying systemic reliability gaps, and implementing corrective actions to improve operational resilience.
Location: Remote (USA) or Hybrid (San Jose, CA)
Salary: $125,000 – $195,000
Company
is a leader in AI cloud infrastructure providing high-performance compute power for AI researchers and enterprises.
What you will do
- Lead the response to critical (SEV-1 / SEV-2) incidents impacting AI infrastructure, GPU clusters, networking, and data center operations.
- Serve as Incident Commander during major outages, coordinating engineering, networking, facilities, and vendor teams.
- Own the entire incident response lifecycle, including technical triage, escalation, resolution, and post-incident reviews.
- Conduct root cause analysis (RCA) to identify systemic reliability gaps and drive corrective actions.
- Track and report critical metrics such as MTTR, MTTD, and incident recurrence rates.
- Develop and maintain operational playbooks, runbooks, and reliability frameworks.
Requirements
- 8+ years of experience in incident management, site reliability engineering (SRE), or infrastructure operations.
- Experience managing incidents in large-scale distributed infrastructure environments.
- Strong understanding of data center operations, GPU compute clusters, networking, and storage infrastructure.
- Proficiency with tools such as PagerDuty, ServiceNow, Jira, Datadog, and Prometheus/Grafana.
- Must be based in the USA.
Nice to have
- Experience operating AI or HPC infrastructure, specifically NVIDIA clusters and InfiniBand networks.
- Background in hyperscale or colocation data center environments.
- Familiarity with the Incident Command System (ICS).
- Experience building and developing incident command processes from scratch.
Culture & Benefits
- Generous cash and equity compensation packages.
- Comprehensive health, dental, and vision coverage for employees and dependents.
- 401k Plan with 2% company match for USA employees.
- Flexible paid time off (PTO) plan.
- Wellness and commuter stipends for select roles.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →