Senior Software Engineer (AI Reliability Engineering)

255 000 - 325 000GBP

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Software Engineer (AI Reliability Engineering): Elevating the reliability of hirify.global’s token path from client to inference servers for large language models, with an accent on designing and implementing high-availability infrastructure. Focus on developing monitoring systems, automated failover, incident response, and cost optimization for large-scale AI infrastructure.

Location: London, UK (Hybrid - expected in office 25% of the time). Visa sponsorship available.

Salary: £255,000 - £325,000 GBP (Annual)

Company

hirify.global is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems for society.

What you will do

Develop Service Level Objectives for large language model serving and training systems.
Design and implement monitoring systems including availability, latency, and other salient metrics.
Assist in the design and implementation of high-availability language model serving infrastructure.
Develop and manage automated failover and recovery systems across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery and systemic improvements.
Build and maintain cost optimization systems for large-scale AI infrastructure.

Requirements

Extensive experience with distributed systems observability and monitoring at scale.
Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines.
Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
Comfortable working with both traditional metrics and AI-specific metrics.
Experience with chaos engineering and systematic resilience testing.
Ability to effectively bridge the gap between ML engineers and infrastructure teams.
Excellent communication skills.
At least a Bachelor's degree in a related field or equivalent experience is required.
Ability to work from the London, UK office at least 25% of the time.

Nice to have

Experience operating large-scale model training or serving infrastructure (>1000 GPUs).
Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium) or ML-specific networking optimizations (RDMA, InfiniBand).
Expertise in AI-specific observability tools and frameworks.
Understanding of ML model deployment strategies and their reliability implications.
Contribution to open-source infrastructure or ML tooling.

Culture & Benefits

Work as a single cohesive team on a few large-scale research efforts in AI.
Value impact, advancing long-term goals of steerable, trustworthy AI.
Extremely collaborative group with frequent research discussions.
Offer competitive compensation and benefits, optional equity donation matching.
Generous vacation and parental leave, flexible working hours.
Lovely office space to collaborate with colleagues.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...

Senior Software Engineer (AI Reliability Engineering)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Staff Site Reliability Engineer (GCP)

Senior Cloud Native Platform Engineer (AI)

Senior Site Reliability Engineer (Fintech)

Infrastructure Deployment Architect (AI)

Public Cloud Azure SRE Engineer (Azure)

Cloud Engineer (Azure)