Назад
Company hidden
4 часа назад

Staff Cloud SRE (AI/ML)

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Cloud SRE (AI/ML): Building and scaling the reliability foundations of an AI cloud platform and GPU compute infrastructure with an accent on SLO/SLI operationalization, GPU cluster efficiency, and production readiness. Focus on designing self-healing patterns, reducing operational burden, and establishing SRE standards from the ground up.

Location: Hybrid in London, United Kingdom (minimum 2 days a week in the office)

Company

hirify.global is a leading developer of Embodied AI technology creating mapless and hardware-agnostic AI products for automated driving systems.

What you will do

  • Own the reliability, availability, and performance of the Model Development Platform and GPU Compute environments.
  • Define and operationalize SLOs, SLIs, and error budgets across platform services.
  • Lead incident triage, escalation, and root cause analysis as part of a 24/7 on-call rotation.
  • Design and operate monitoring, logging, and tracing systems to enable rapid detection and recovery.
  • Build automation for cluster operations, training workflows, and self-healing recovery patterns.
  • Harden CI/CD and release processes to improve deployment safety and velocity.

Requirements

  • Proven experience as an SRE or Production Engineer supporting large-scale cloud systems.
  • Experience operating GPU-backed environments or large-scale ML infrastructure.
  • Strong production experience with Kubernetes and major cloud providers (AWS, GCP, or Azure).
  • Proficiency in Linux fundamentals and at least one systems language such as Python, Go, or C++.
  • Deep troubleshooting skills across networking, storage, and distributed systems.
  • Must be based in London to support a hybrid work schedule

Nice to have

  • Familiarity with Infrastructure-as-Code tools like Terraform.
  • Experience as a founding SRE hire establishing processes from scratch.
  • Experience defining and running reliability programs across multiple teams.

Culture & Benefits

  • Hybrid working policy combining office-based innovation with remote flexibility.
  • Opportunity to be a founding SRE shaping a new function within a high-growth AI company.
  • Inclusive work environment that values diversity and new perspectives.
  • Exposure to cutting-edge Embodied AI and autonomous driving technology.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →