Staff Cloud SRE (AI/ML)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Cloud SRE (AI/ML): Building and scaling the reliability foundations of an AI cloud platform and GPU compute infrastructure with an accent on SLO/SLI operationalization, GPU cluster efficiency, and production readiness. Focus on designing self-healing patterns, reducing operational burden, and establishing SRE standards from the ground up.
Location: Hybrid in London, United Kingdom (minimum 2 days a week in the office)
Company
is a leading developer of Embodied AI technology creating mapless and hardware-agnostic AI products for automated driving systems.
What you will do
- Own the reliability, availability, and performance of the Model Development Platform and GPU Compute environments.
- Define and operationalize SLOs, SLIs, and error budgets across platform services.
- Lead incident triage, escalation, and root cause analysis as part of a 24/7 on-call rotation.
- Design and operate monitoring, logging, and tracing systems to enable rapid detection and recovery.
- Build automation for cluster operations, training workflows, and self-healing recovery patterns.
- Harden CI/CD and release processes to improve deployment safety and velocity.
Requirements
- Proven experience as an SRE or Production Engineer supporting large-scale cloud systems.
- Experience operating GPU-backed environments or large-scale ML infrastructure.
- Strong production experience with Kubernetes and major cloud providers (AWS, GCP, or Azure).
- Proficiency in Linux fundamentals and at least one systems language such as Python, Go, or C++.
- Deep troubleshooting skills across networking, storage, and distributed systems.
- Must be based in London to support a hybrid work schedule
Nice to have
- Familiarity with Infrastructure-as-Code tools like Terraform.
- Experience as a founding SRE hire establishing processes from scratch.
- Experience defining and running reliability programs across multiple teams.
Culture & Benefits
- Hybrid working policy combining office-based innovation with remote flexibility.
- Opportunity to be a founding SRE shaping a new function within a high-growth AI company.
- Inclusive work environment that values diversity and new perspectives.
- Exposure to cutting-edge Embodied AI and autonomous driving technology.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →