Senior ML Systems Engineer (AI)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Страна

France/UK/US +1 еще

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior ML Systems Engineer (AI): Build, maintain, and evolve the training framework powering large-scale language model training with an accent on distributed training, HPC infrastructure, and tooling development. Focus on designing scalable training abstractions, improving throughput on multi-node clusters, and building robust systems for reproducible large-scale runs.

Location: Remote with offices in London, Paris, New York, Toronto, Montreal, and San Francisco

Company

hirify.global is a leading AI company focused on training and deploying frontier models to power advanced AI systems for developers and enterprises.

What you will do

Build and own the training framework for large-scale LLM training
Design distributed training abstractions including data, tensor, and pipeline parallelism
Improve training throughput and stability on multi-node HPC clusters
Develop tooling for monitoring, logging, debugging, and developer ergonomics
Collaborate with infrastructure teams to support high-performance training environments
Investigate and resolve performance bottlenecks across the ML systems stack

Requirements

Location: Remote with presence in London, Paris, New York, Toronto, Montreal, or San Francisco
Strong experience in large-scale distributed training or HPC systems
Familiarity with JAX internals, distributed training libraries, and multi-node cluster orchestration (Slurm, Ray, Kubernetes)
Experience debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
Experience with containerized environments such as Docker and Singularity/Apptainer
Strong collaboration skills to work with infra, research, and deployment teams

Nice to have

Experience training LLMs or large transformer architectures
Contributions to ML frameworks like PyTorch, JAX, DeepSpeed, Megatron
Familiarity with evaluation and serving frameworks such as vLLM and TensorRT-LLM
Background in performance engineering, profiling, or low-level systems
Publications at top-tier ML conferences

Culture & Benefits

Inclusive and open culture with a world-class AI research team
Weekly lunch stipend, in-office lunches, and snacks
Full health and dental benefits including mental health budget
100% parental leave top-up for up to 6 months
Personal enrichment benefits for arts, fitness, and workspace improvement
Remote-flexible with offices in multiple major cities and co-working stipend
6 weeks of vacation (30 working days)

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →