TL;DR
Senior Software Engineer (AI Reliability Engineering): Elevating the reliability of hirify.global’s token path from client to inference servers for large language models, with an accent on designing and implementing high-availability infrastructure. Focus on developing monitoring systems, automated failover, incident response, and cost optimization for large-scale AI infrastructure.
Location: London, UK (Hybrid - expected in office 25% of the time). Visa sponsorship available.
Salary: £255,000 - £325,000 GBP (Annual)
Company
hirify.global is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems for society.
What you will do
- Develop Service Level Objectives for large language model serving and training systems.
- Design and implement monitoring systems including availability, latency, and other salient metrics.
- Assist in the design and implementation of high-availability language model serving infrastructure.
- Develop and manage automated failover and recovery systems across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery and systemic improvements.
- Build and maintain cost optimization systems for large-scale AI infrastructure.
Requirements
- Extensive experience with distributed systems observability and monitoring at scale.
- Understanding of the unique challenges of operating AI infrastructure, including model serving, batch inference, and training pipelines.
- Proven experience implementing and maintaining SLO/SLA frameworks for business-critical services.
- Comfortable working with both traditional metrics and AI-specific metrics.
- Experience with chaos engineering and systematic resilience testing.
- Ability to effectively bridge the gap between ML engineers and infrastructure teams.
- Excellent communication skills.
- At least a Bachelor's degree in a related field or equivalent experience is required.
- Ability to work from the London, UK office at least 25% of the time.
Nice to have
- Experience operating large-scale model training or serving infrastructure (>1000 GPUs).
- Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium) or ML-specific networking optimizations (RDMA, InfiniBand).
- Expertise in AI-specific observability tools and frameworks.
- Understanding of ML model deployment strategies and their reliability implications.
- Contribution to open-source infrastructure or ML tooling.
Culture & Benefits
- Work as a single cohesive team on a few large-scale research efforts in AI.
- Value impact, advancing long-term goals of steerable, trustworthy AI.
- Extremely collaborative group with frequent research discussions.
- Offer competitive compensation and benefits, optional equity donation matching.
- Generous vacation and parental leave, flexible working hours.
- Lovely office space to collaborate with colleagues.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →