Назад
Company hidden
19 часов назад

Staff SRE (AI)

Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff SRE (AI): Owning the reliability of Andromeda's infrastructure end to end, from node racking to customer-facing training runs with an accent on GPU infrastructure at scale and multi-thousand-GPU run health. Focus on leading incident responses, production operations of GPU fleets, and building observability & health systems.

Location: North America Remote / San Francisco, CA

Company

hirify.global gives early-stage startups access to scaled AI infrastructure.

What you will do

  • Carry the pager and lead incident responses for top-customer training runs and multi-cluster incidents.
  • Own the day-to-day health of thousands of GPUs across providers and generations, including node lifecycle, validation, and repair workflows.
  • Build and own telemetry, GPU health checks, fabric monitoring, and automated remediation.
  • Define on-call practices, including rotations, escalation, runbooks, and incident command.
  • Be the senior reliability voice with sophisticated AI infra customers and providers.
  • Partner with the product team to design with SLOs, error budgets, and failure modes in mind.

Requirements

  • Multiple years building and operating large-scale GPU infrastructure.
  • A clear history of owning the reliability of load-bearing infrastructure.
  • Deep, hands-on experience with NVIDIA H100/H200/B200/GB200 at scale.
  • Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training.
  • Working knowledge of how large training jobs run, including NCCL, CUDA, and PyTorch distributed.
  • Strong Go, Python, or Rust skills for building production tooling and automation.
  • Expert-level Linux and systems internals knowledge.
  • Comfortable being the senior engineer on a P0 bridge with customers and providers.
  • Comfortable being the senior technical voice with AI infra customers, providers, and prospects.

Nice to have

  • Experience building custom GPU health systems, fleet managers, or fabric controllers.
  • Distributed storage depth and opinions on checkpoint I/O patterns at scale.
  • Experience profiling and diagnosing distributed training.
  • Experience as the senior SRE partner in enterprise relationships for AI infrastructure or HPC.
  • Open-source contributions in the GPU/AI infra stack.
  • Public talks, writing, or community presence in the GPU/AI infra industry.

Culture & Benefits

  • Significant autonomy and leverage working on infrastructure depended on by ambitious AI labs.
  • Opportunity to stay hands-on in code, customer interactions, and critical incidents.
  • Equal opportunity employer committed to diversity and inclusion.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →