Назад
Company hidden
2 дня назад

Senior Software Engineer (AI Reliability Engineering)

255 000 - 325 000GBP
Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Software Engineer (AI Reliability Engineering): Elevating the reliability of hirify.global’s token path from client to inference servers for large language models, with an accent on designing and implementing high-availability infrastructure. Focus on developing monitoring systems, automated failover, incident response, and cost optimization for large-scale AI infrastructure.

Location: London, UK (Hybrid - expected in office 25% of the time). Visa sponsorship available.

Salary: £255,000 - £325,000 GBP (Annual)

Company

hirify.global is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems for society.

What you will do

  • Develop Service Level Objectives for large language model serving and training systems.
  • Design and implement monitoring systems including availability, latency, and other salient metrics.
  • Assist in the design and implementation of high-availability language model serving infrastructure.
  • Develop and manage automated failover and recovery systems across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery and systemic improvements.
  • Build and maintain cost optimization systems for large-scale AI infrastructure.

Requirements

  • Extensive experience with distributed systems observability and monitoring at scale.
  • Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines.
  • Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
  • Comfortable working with both traditional metrics and AI-specific metrics.
  • Experience with chaos engineering and systematic resilience testing.
  • Ability to effectively bridge the gap between ML engineers and infrastructure teams.
  • Excellent communication skills.
  • At least a Bachelor's degree in a related field or equivalent experience is required.
  • Ability to work from the London, UK office at least 25% of the time.

Nice to have

  • Experience operating large-scale model training or serving infrastructure (>1000 GPUs).
  • Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium) or ML-specific networking optimizations (RDMA, InfiniBand).
  • Expertise in AI-specific observability tools and frameworks.
  • Understanding of ML model deployment strategies and their reliability implications.
  • Contribution to open-source infrastructure or ML tooling.

Culture & Benefits

  • Work as a single cohesive team on a few large-scale research efforts in AI.
  • Value impact, advancing long-term goals of steerable, trustworthy AI.
  • Extremely collaborative group with frequent research discussions.
  • Offer competitive compensation and benefits, optional equity donation matching.
  • Generous vacation and parental leave, flexible working hours.
  • Lovely office space to collaborate with colleagues.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...