Назад
Company hidden
13 часов назад

Senior Site Reliability Engineer (AI)

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (AI): Keeping hirify.global’s autonomous driving fleet reliable, observable, and safe while operating on public roads with an accent on turning real-world incidents and performance bottlenecks into lasting engineering improvements. Focus on designing and delivering automation for fleet operations, deployments, and repetitive workflows to reduce manual intervention and harden the production environment.

Location: Hybrid (London, United Kingdom), requiring 2-3 days a week in the office.

Company

hirify.global is a leading developer of Embodied AI technology, founded in 2017, focused on creating intelligent, mapless, and hardware-agnostic AI products for automakers to accelerate the transition to automated driving.

What you will do

  • Improve reliability, availability, and performance of vehicle software systems across the dev fleet.
  • Participate in on-call rotation to provide out-of-hours support for live systems.
  • Build and operate monitoring, logging, alerting, and on-call tooling for fast detection and recovery.
  • Drive incident response and post-incident learning, translating root causes into durable fixes.
  • Design and deliver automation for fleet operations, deployments, and repetitive workflows.
  • Partner with Vehicle SW, operations, and platform teams to define SLOs, reliability metrics, and release readiness.

Requirements

  • Proven SRE, production reliability, or platform operations experience in complex distributed systems.
  • Strong Linux fundamentals and hands-on experience with CI/CD, containers (Docker), and orchestration (Kubernetes).
  • Proficiency in at least one systems or scripting language (Python, C++, or Rust) with a bias for automation.
  • Deep troubleshooting skills across networking, distributed systems, and databases, including performance and availability issues.
  • Experience designing observability stacks and using tools such as Datadog, Prometheus, Grafana, OpenTelemetry, Splunk, or Humio.
  • Clear communication skills, including incident leadership, writing postmortems, and influencing engineering priorities.
  • Work from the London office a minimum of 2-3 days a week.

Nice to have

  • Cloud platform experience (AWS, GCP, or Azure), including infrastructure-as-code and secure production operations.
  • Experience with real-time or safety-critical systems, hardware-in-the-loop, or embedded/robotics environments.
  • Familiarity with fleet operations, telemetry pipelines, and operating software on edge devices at scale.
  • Experience defining and running SLOs/SLIs and reliability programs across multiple teams.

Culture & Benefits

  • Commitment to creating a diverse, fair, and respectful culture that is inclusive of everyone.
  • Emphasis on embracing uncertainty, leaning into complex challenges, and continuous learning and evolving.
  • Values diversity, embraces new perspectives, and fosters an inclusive work environment.
  • Hybrid working policy combining office time for innovation, culture, relationships, and learning with working from home.
  • Committed to creating an inclusive interview experience and provides accommodations or adjustments if required.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...