Incident & Change Champion (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Incident & Change Champion (AI Infrastructure): Owning and optimizing Incident and Change Management processes for a GPU cloud platform with an accent on operational discipline and tooling implementation. Focus on reducing system downtime through disciplined major incident coordination, CAB leadership, and fostering a blameless postmortem culture.
Location: Remote (Global)
Company
is a GPU cloud engineered for AI, providing cost-effective, high-performance infrastructure for AI start-ups and large enterprise customers.
What you will do
- Develop and refine Incident and Change Management processes to v1.0, including severity declarations, SLA/SLO tables, and communication ladders.
- Lead the migration and implementation of incident and change workflows within Jira Service Management.
- Act as Incident Commander or Major Incident Manager for SEV-1 and complex SEV-2 events, coordinating internal and external communications.
- Chair the Change Advisory Board (CAB) and manage the change calendar, including freeze windows for critical periods.
- Train and certify a pool of Incident Commanders across Support and SRE teams and run quarterly tabletop exercises.
- Define and report key operational metrics (MTTA, MTTR, change success rate) to the senior leadership team.
Requirements
- 5+ years in ITSM / Service Management roles with direct ownership of Incident and Change Management processes.
- Hands-on experience facilitating major incidents end-to-end as an Incident Commander in a 24/7 production environment.
- Demonstrable experience running a Change Advisory Board or equivalent change-review forum.
- Proven track record configuring Jira Service Management, ServiceNow, or equivalent ITSM tooling.
- Strong technical writing skills for process documents, postmortems, and executive reports.
- Comfort holding the room under pressure with senior stakeholders, engineers, and customers concurrently.
Nice to have
- Experience in cloud, hyperscaler, AI infrastructure, or HPC environments.
- Familiarity with SRE concepts, including SLOs, error budgets, and runbook discipline.
- Experience designing and running tabletop exercises and game days.
- Experience operating processes for regulated or sovereign customer workloads.
- Familiarity with Jira automation and JSM portals.
Culture & Benefits
- Competitive compensation package including base salary and equity with annual reviews.
- Remote-first work environment with high autonomy and human-first flexibility.
- Opportunity to join a fast-growing tech startup pushing the boundaries of AI infrastructure.
- Dynamic progression plan tailored to individual professional ambitions.
- Collaborative and supportive environment focused on ownership, transparency, and accountability.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →