Назад
Company hidden
7 месяцев назад

Senior ML Systems Engineer (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
France/UK/US +1 еще
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior ML Systems Engineer (AI): Build, maintain, and evolve the training framework powering large-scale language model training with an accent on distributed training, HPC infrastructure, and tooling development. Focus on designing scalable training abstractions, improving throughput on multi-node clusters, and building robust systems for reproducible large-scale runs.

Location: Remote with offices in London, Paris, New York, Toronto, Montreal, and San Francisco

Company

hirify.global is a leading AI company focused on training and deploying frontier models to power advanced AI systems for developers and enterprises.

What you will do

  • Build and own the training framework for large-scale LLM training
  • Design distributed training abstractions including data, tensor, and pipeline parallelism
  • Improve training throughput and stability on multi-node HPC clusters
  • Develop tooling for monitoring, logging, debugging, and developer ergonomics
  • Collaborate with infrastructure teams to support high-performance training environments
  • Investigate and resolve performance bottlenecks across the ML systems stack

Requirements

  • Location: Remote with presence in London, Paris, New York, Toronto, Montreal, or San Francisco
  • Strong experience in large-scale distributed training or HPC systems
  • Familiarity with JAX internals, distributed training libraries, and multi-node cluster orchestration (Slurm, Ray, Kubernetes)
  • Experience debugging performance issues across CUDA/NCCL, networking, IO, and data pipelines
  • Experience with containerized environments such as Docker and Singularity/Apptainer
  • Strong collaboration skills to work with infra, research, and deployment teams

Nice to have

  • Experience training LLMs or large transformer architectures
  • Contributions to ML frameworks like PyTorch, JAX, DeepSpeed, Megatron
  • Familiarity with evaluation and serving frameworks such as vLLM and TensorRT-LLM
  • Background in performance engineering, profiling, or low-level systems
  • Publications at top-tier ML conferences

Culture & Benefits

  • Inclusive and open culture with a world-class AI research team
  • Weekly lunch stipend, in-office lunches, and snacks
  • Full health and dental benefits including mental health budget
  • 100% parental leave top-up for up to 6 months
  • Personal enrichment benefits for arts, fitness, and workspace improvement
  • Remote-flexible with offices in multiple major cities and co-working stipend
  • 6 weeks of vacation (30 working days)

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →