TL;DR
Training Process Management Engineer (Backend): Develop and optimize distributed operating system software that orchestrates and supervises large-scale machine learning training workloads across thousands of machines. With an accent on performance, correctness, scalability, and reliability. Focus on designing and debugging high-performance asynchronous systems and managing complex distributed system challenges at frontier AI scale.
Location: London, UK with hybrid work model (3 days in office per week) and relocation assistance available
Company
hirify.global is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity by pushing the boundaries of AI capabilities and safely deploying AI products.
What you will do
- Work across Python and Rust stacks to build and maintain software for orchestration and monitoring of ML workloads on supercomputers
- Profile and optimize software stack for computation orchestration at frontier scale
- Improve reliability, observability, and fault tolerance of long-running jobs
- Debug complex distributed system issues across large clusters
- Adapt to evolving ML system requirements to support researchers
Requirements
- Location: Must be based in or willing to relocate to London, UK
- Experience developing distributed systems and strong software engineering skills
- Proficiency in Rust and Python or another systems programming language (e.g., C++)
- Solid Linux knowledge with systems-level debugging, performance analysis, and memory profiling
- Experience with asynchronous and concurrent systems development
- Strong focus on performance, correctness, and reliability
Culture & Benefits
- Hybrid work model with 3 days in office per week
- Relocation assistance for new employees
- Equal opportunity employer with commitment to diversity and inclusion
- Supportive environment valuing engineering ownership and agency
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →