Назад
Company hidden
21 час назад

Staff Software Engineer (AI Runtime)

190 000 - 265 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Software Engineer (AI Runtime): Building and scaling the managed GPU training platform for large-scale AI models with an accent on distributed training performance and fault tolerance. Focus on designing multi-node orchestration, optimizing GPU efficiency, and developing resilience foundations for frontier-scale foundation models.

Location: Mountain View, California or San Francisco, California

Salary: $190,000 — $265,000 USD

Company

hirify.global is a data and AI company providing a Data Intelligence Platform that unifies data, analytics, and AI for over 10,000 organizations worldwide.

What you will do

  • Drive the architecture and evolution of the AI Runtime (AIR) managed GPU training platform for scalable, high-throughput training.
  • Solve complex problems in multi-node orchestration, distributed parallelism strategies, and GPU scheduling.
  • Optimize GPU efficiency and training performance to raise utilization and lower cost per training run.
  • Build resilience and observability foundations to detect and recover from hardware and software failures.
  • Partner with product and research teams to shape APIs, CLI, and the developer experience for production training jobs.
  • Mentor senior engineers and champion engineering excellence to shape the long-term technical direction of AI training infrastructure.

Requirements

  • 10+ years of experience building and operating large-scale distributed systems, GPU training infrastructure, or ML systems.
  • Hands-on experience with distributed training frameworks such as PyTorch, FSDP, DeepSpeed, or Megatron.
  • Deep understanding of parallelism strategies (data, tensor, pipeline, and sequence parallelism).
  • Strong grasp of GPU performance fundamentals, including NVLink, InfiniBand, and collective communication.
  • Experience building managed, multi-tenant cloud platform products with clear SLAs and SLOs.
  • BS in Computer Science or a related field (MS or PhD preferred).

Culture & Benefits

  • Comprehensive benefits and perks tailored to the employee's region.
  • Opportunity to work on the most demanding workloads in computing, including frontier-scale foundation models.
  • Collaborative environment partnering across product, research, and platform teams.
  • Commitment to diversity, inclusion, and equal employment opportunity standards.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →