2 месяца назад

Software Engineer (ML Infrastructure)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Software Engineer (ML Infrastructure): Architecting and scaling a high-performance training platform for large-scale GPU clusters with an accent on multi-tenant orchestration and scheduling primitives. Focus on optimizing GPU utilization, ensuring system reliability for multi-thousand GPU workloads, and integrating CNCF ecosystem tools.

Location: Must be based in the United States (based on US Department of Labor compliance and commuter benefits).

Company

Scale AI develops reliable AI systems and high-quality data technologies that power the world's leading models for enterprises and governments.

What you will do

Architect and scale a multi-tenant orchestration layer for GPU clusters to ensure high utilization and seamless job recovery.
Design and implement scheduling primitives to optimize the lifecycle of training jobs.
Develop deep observability and automated health-checking into the training stack to isolate hardware failures.
Evaluate and integrate emerging technologies in the CNCF and AI ecosystem, such as Ray and Kueue.
Collaborate with Finance and Procurement teams to drive the capacity planning process.
Own projects end-to-end, from requirements and scoping to implementation.

Requirements

5+ years of experience in backend or infrastructure engineering.
At least 2 years of experience orchestrating ML workloads at scale (100+ GPU nodes).
Strong programming skills in Python, Go, Rust, or C++.
Expert-level knowledge of Kubernetes internals, including Custom Resources, Operators, and Admission Controllers.
Experience with distributed training infrastructure (EFA, Infiniband) and distributed storage (Lustre, S3).
Must have valid work authorization for the United States.

Nice to have

Experience with distributed training techniques such as DeepSpeed or FSDP.
Experience with the NVIDIA software and hardware stack (CUDA, NCCL).
Experience with PyTorch.
Familiarity with post-training algorithms like GRPO and Reinforcement Learning.

Culture & Benefits

Comprehensive health, dental, and vision coverage.
Retirement benefits and a learning and development stipend.
Generous PTO and potential commuter stipend.
Inclusive and equal opportunity workplace committed to diversity.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Software Engineer (ML Infrastructure)

Scale AI

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Research Engineer, SysML (AI)

Senior Machine Learning Engineer (AI)

Senior Software Engineer (AI)

Principal Engineer (AI Engineering)

ML Infrastructure Engineer (AI)

Staff Senior Software Engineer (AI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business

Software Engineer (ML Infrastructure)

Scale AI

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Categories

Похожие вакансии

Research Engineer, SysML (AI)

Senior Machine Learning Engineer (AI)

Senior Software Engineer (AI)

Principal Engineer (AI Engineering)

ML Infrastructure Engineer (AI)

Staff Senior Software Engineer (AI)