Staff Software Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Software Engineer (AI Infrastructure): Building and scaling compute infrastructure for AI models with an accent on node lifecycle management, automated hardware repair, and cluster orchestration. Focus on optimizing accelerator capacity (GPU/TPU), designing high-availability distributed systems, and scaling infrastructure to hundreds of thousands of hosts.
Location: Hybrid in London, UK (must be in office at least 25% of the time)
Salary: £325,000 - £485,000 GBP
Company
is a public benefit corporation dedicated to creating reliable, interpretable, and steerable AI systems for the benefit of society.
What you will do
- Own the technical strategy and roadmap for node lifecycle management, including ingestion, bring-up, health checking, and automated repair.
- Drive cross-team initiatives to scale AI clusters across multiple clouds and accelerator families.
- Design and operate systems that automatically detect and remediate unhealthy hardware to minimize capacity loss.
- Define high-level infrastructure architecture and solve complex technical challenges directly or through other engineers.
- Collaborate with cloud providers and internal research/product teams to shape long-term compute and data strategy.
- Provide technical mentorship and coaching to support the growth of other engineers.
Requirements
- Deep expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure).
- Strong proficiency in Rust, Go, or Python and expertise with Terraform.
- Hands-on experience with machine learning accelerators such as GPUs, TPUs, or Trainium.
- Track record of leading complex, multi-quarter technical initiatives spanning multiple teams.
- Must be based in or able to work from the London office at least 25% of the time.
- Bachelor’s degree or equivalent professional experience in a relevant field.
Nice to have
- Experience managing hyperscale compute infrastructure (10K+ nodes).
- Deep knowledge of Kubernetes internals (scheduler, autoscaler, kubelet, Karpenter) or orchestration systems like Borg or Mesos.
- Low-level systems experience with kernel, virtualization, device drivers, or firmware.
- Familiarity with high-performance networking (EFA, RDMA, InfiniBand) for distributed ML workloads.
- Contributions to relevant open-source projects (e.g., Kubernetes, Linux kernel).
Culture & Benefits
- Competitive compensation package including optional equity donation matching.
- Generous vacation and parental leave.
- Flexible working hours and a collaborative office environment.
- Strong focus on AI safety and commitment to team diversity and representation.
- Visa sponsorship available for qualified candidates.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →