Назад
Company hidden
14 часов назад

Staff Cloud Site Reliability Engineer (AI)

Формат работы
hybrid
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
UK
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Cloud Site Reliability Engineer (AI): Building and scaling the reliability foundations of hirify.global's AI cloud platform, including the Model Development Platform and GPU Compute platform, with an accent on ensuring predictable, efficient, and scalable operation of model development infrastructure, distributed systems, and large compute clusters. Focus on faster model training, reliable experimentation, and scalable AI deployment by ensuring cloud infrastructure is resilient and performant.

Location: Must be in London, United Kingdom with hybrid work (2 days a week in the office).

Company

hirify.global is the leading developer of Embodied AI technology that enables vehicles to perceive, understand, and navigate complex environments.

What you will do

  • Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments.
  • Define and operationalise SLOs, SLIs, and error budgets across platform services.
  • Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters.
  • Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents.
  • Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
  • Build automation for cluster operations, training workflows, remediation, and scaling tasks.

Requirements

  • Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
  • Strong Kubernetes experience, including operating production clusters.
  • Hands-on experience running production workloads in AWS, GCP, or Azure.
  • Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads.
  • Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred.
  • Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation.

Nice to have

  • Experience operating GPU-backed environments or large-scale ML infrastructure.
  • Experience running model training or inference pipelines in production (MLOps).
  • Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments.
  • Experience defining and running SLOs/SLIs and building reliability programs across multiple teams.
  • Experience as an early or founding SRE hire establishing processes from scratch.

Culture & Benefits

  • Operate a hybrid working policy that combines time together in our offices and workshops to fuel innovation, culture, relationships and learning, and time spent working from home.
  • Diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...