Staff Cloud Site Reliability Engineer (AI)

Формат работы

hybrid

Тип работы

fulltime

Грейд

lead

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Staff Cloud Site Reliability Engineer (AI): Building and scaling the reliability foundations of hirify.global's AI cloud platform, including the Model Development Platform and GPU Compute platform, with an accent on ensuring predictable, efficient, and scalable operation of model development infrastructure, distributed systems, and large compute clusters. Focus on faster model training, reliable experimentation, and scalable AI deployment by ensuring cloud infrastructure is resilient and performant.

Location: Must be in London, United Kingdom with hybrid work (2 days a week in the office).

Company

hirify.global is the leading developer of Embodied AI technology that enables vehicles to perceive, understand, and navigate complex environments.

What you will do

Own the reliability, availability, and performance of the Model Dev Platform and GPU Compute environments.
Define and operationalise SLOs, SLIs, and error budgets across platform services.
Improve capacity planning, scaling strategies, and resource efficiency across large GPU-backed clusters.
Participate in a 24/7 on-call rotation as first-line response for cloud and cluster-related incidents.
Design and operate monitoring, logging, tracing, and alerting systems that enable rapid detection and recovery.
Build automation for cluster operations, training workflows, remediation, and scaling tasks.

Requirements

Proven experience in an SRE, Production Engineer, or Cloud Reliability role supporting large-scale cloud systems.
Strong Kubernetes experience, including operating production clusters.
Hands-on experience running production workloads in AWS, GCP, or Azure.
Experience operating complex distributed systems in production, ideally including compute-heavy or high-performance workloads.
Experience working with large compute clusters; exposure to AI/ML training or inference workloads strongly preferred.
Strong Linux fundamentals and proficiency in at least one scripting or systems language (e.g. Python, Go, C++) with a bias toward automation.

Nice to have

Experience operating GPU-backed environments or large-scale ML infrastructure.
Experience running model training or inference pipelines in production (MLOps).
Familiarity with infrastructure-as-code (e.g. Terraform) and secure cloud production environments.
Experience defining and running SLOs/SLIs and building reliability programs across multiple teams.
Experience as an early or founding SRE hire establishing processes from scratch.

Culture & Benefits

Operate a hybrid working policy that combines time together in our offices and workshops to fuel innovation, culture, relationships and learning, and time spent working from home.
Diverse, fair and respectful culture that is inclusive of everyone based on their unique skills and perspectives.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Staff Cloud Site Reliability Engineer (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Senior Software Engineer - Fleet Management (AI)

Software Engineer, Infrastructure Reliability (AI)

Site Reliability Engineer (AI)

Lead Site Reliability Engineer (AWS)

Site Reliability Engineer (AI Infrastructure)

Cloud DevOps Engineer (AWS/GenAI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business