Staff Software Engineer (AI Infrastructure)

325 000 - 485 000GBP

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Staff Software Engineer (AI Infrastructure): Building and scaling compute infrastructure for AI models with an accent on node lifecycle management, automated hardware repair, and cluster orchestration. Focus on optimizing accelerator capacity (GPU/TPU), designing high-availability distributed systems, and scaling infrastructure to hundreds of thousands of hosts.

Location: Hybrid in London, UK (must be in office at least 25% of the time)

Salary: £325,000 - £485,000 GBP

Company

hirify.global is a public benefit corporation dedicated to creating reliable, interpretable, and steerable AI systems for the benefit of society.

What you will do

Own the technical strategy and roadmap for node lifecycle management, including ingestion, bring-up, health checking, and automated repair.
Drive cross-team initiatives to scale AI clusters across multiple clouds and accelerator families.
Design and operate systems that automatically detect and remediate unhealthy hardware to minimize capacity loss.
Define high-level infrastructure architecture and solve complex technical challenges directly or through other engineers.
Collaborate with cloud providers and internal research/product teams to shape long-term compute and data strategy.
Provide technical mentorship and coaching to support the growth of other engineers.

Requirements

Deep expertise in distributed systems, reliability, and cloud platforms (Kubernetes, IaC, AWS/GCP/Azure).
Strong proficiency in Rust, Go, or Python and expertise with Terraform.
Hands-on experience with machine learning accelerators such as GPUs, TPUs, or Trainium.
Track record of leading complex, multi-quarter technical initiatives spanning multiple teams.
Must be based in or able to work from the London office at least 25% of the time.
Bachelor’s degree or equivalent professional experience in a relevant field.

Nice to have

Experience managing hyperscale compute infrastructure (10K+ nodes).
Deep knowledge of Kubernetes internals (scheduler, autoscaler, kubelet, Karpenter) or orchestration systems like Borg or Mesos.
Low-level systems experience with kernel, virtualization, device drivers, or firmware.
Familiarity with high-performance networking (EFA, RDMA, InfiniBand) for distributed ML workloads.
Contributions to relevant open-source projects (e.g., Kubernetes, Linux kernel).

Culture & Benefits

Competitive compensation package including optional equity donation matching.
Generous vacation and parental leave.
Flexible working hours and a collaborative office environment.
Strong focus on AI safety and commitment to team diversity and representation.
Visa sponsorship available for qualified candidates.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Staff Software Engineer (AI Infrastructure)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Staff Cloud SRE (AI/ML)

Senior Engineer (Fintech)

DevOps Engineer (AI)

Senior Site Reliability Engineer (AI)

Lead Infrastructure Engineer (OpenStack)

DevOps Engineer