Senior ML Systems Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior ML Systems Engineer (AI): Build, maintain, and evolve the training framework powering large-scale language model training with an accent on distributed training, HPC infrastructure, and tooling development. Focus on designing scalable training abstractions, improving throughput on multi-node clusters, and building robust systems for reproducible large-scale runs.
Location: Remote with offices in London, Paris, New York, Toronto, Montreal, and San Francisco
Company
is a leading AI company focused on training and deploying frontier models to power advanced AI systems for developers and enterprises.
What you will do
- Build and own the training framework for large-scale LLM training
- Design distributed training abstractions including data, tensor, and pipeline parallelism
- Improve training throughput and stability on multi-node HPC clusters
- Develop tooling for monitoring, logging, debugging, and developer ergonomics
- Collaborate with infrastructure teams to support high-performance training environments
- Investigate and resolve performance bottlenecks across the ML systems stack
Requirements
- Location: Remote with presence in London, Paris, New York, Toronto, Montreal, or San Francisco
- Strong experience in large-scale distributed training or HPC systems
- Familiarity with JAX internals, distributed training libraries, and multi-node cluster orchestration (Slurm, Ray, Kubernetes)
- Experience debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
- Experience with containerized environments such as Docker and Singularity/Apptainer
- Strong collaboration skills to work with infra, research, and deployment teams
Nice to have
- Experience training LLMs or large transformer architectures
- Contributions to ML frameworks like PyTorch, JAX, DeepSpeed, Megatron
- Familiarity with evaluation and serving frameworks such as vLLM and TensorRT-LLM
- Background in performance engineering, profiling, or low-level systems
- Publications at top-tier ML conferences
Culture & Benefits
- Inclusive and open culture with a world-class AI research team
- Weekly lunch stipend, in-office lunches, and snacks
- Full health and dental benefits including mental health budget
- 100% parental leave top-up for up to 6 months
- Personal enrichment benefits for arts, fitness, and workspace improvement
- Remote-flexible with offices in multiple major cities and co-working stipend
- 6 weeks of vacation (30 working days)
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →