Назад
Company hidden
19 часов назад

Software Engineer (ML Infrastructure)

Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Software Engineer (ML Infrastructure): Architecting and scaling a high-performance training platform for large-scale GPU clusters with an accent on multi-tenant orchestration and scheduling primitives. Focus on optimizing GPU utilization, ensuring system reliability for multi-thousand GPU workloads, and integrating CNCF ecosystem tools.

Location: Must be based in the United States (based on US Department of Labor compliance and commuter benefits).

Company

hirify.global develops reliable AI systems and high-quality data technologies that power the world's leading models for enterprises and governments.

What you will do

  • Architect and scale a multi-tenant orchestration layer for GPU clusters to ensure high utilization and seamless job recovery.
  • Design and implement scheduling primitives to optimize the lifecycle of training jobs.
  • Develop deep observability and automated health-checking into the training stack to isolate hardware failures.
  • Evaluate and integrate emerging technologies in the CNCF and AI ecosystem, such as Ray and Kueue.
  • Collaborate with Finance and Procurement teams to drive the capacity planning process.
  • Own projects end-to-end, from requirements and scoping to implementation.

Requirements

  • 5+ years of experience in backend or infrastructure engineering.
  • At least 2 years of experience orchestrating ML workloads at scale (100+ GPU nodes).
  • Strong programming skills in Python, Go, Rust, or C++.
  • Expert-level knowledge of Kubernetes internals, including Custom Resources, Operators, and Admission Controllers.
  • Experience with distributed training infrastructure (EFA, Infiniband) and distributed storage (Lustre, S3).
  • Must have valid work authorization for the United States.

Nice to have

  • Experience with distributed training techniques such as DeepSpeed or FSDP.
  • Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
  • Experience with PyTorch.
  • Familiarity with post-training algorithms like GRPO and Reinforcement Learning.

Culture & Benefits

  • Comprehensive health, dental, and vision coverage.
  • Retirement benefits and a learning and development stipend.
  • Generous PTO and potential commuter stipend.
  • Inclusive and equal opportunity workplace committed to diversity.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →