Назад
Company hidden
2 дня назад

Senior Site Reliability Engineer (AI Infrastructure)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (AI Infrastructure): Designing and operating large-scale GPU infrastructure for distributed training and inference with an accent on high-performance networking and hardware reliability. Focus on optimizing GPU cluster architecture, diagnosing fabric-level issues, and building production-grade automation for AI workloads.

Location: Global Remote / San Francisco, CA

Company

hirify.global builds the liquidity layer for global AI compute, providing scaled infrastructure for early-stage startups and leading AI labs.

What you will do

  • Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
  • Act as the primary technical partner for customers, onboarding and optimizing their large-scale training workloads.
  • Define SLOs and error budgets tailored to GPU-specific failure modes such as ECC errors and NVLink degradation.
  • Manage the health and performance of high-speed interconnects including InfiniBand, RoCE, and NVLink.
  • Build deep observability for GPU utilization, memory pressure, and interconnect throughput.
  • Develop production-grade automation for cluster provisioning, health checks, and firmware lifecycle management.

Requirements

  • Deep hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200).
  • Production experience with InfiniBand, RoCE, or NVLink fabrics for distributed training.
  • Knowledge of ML frameworks and systems-level training operations (NCCL, CUDA, PyTorch, DeepSpeed, FSDP).
  • Expert-level Linux skills, including kernel tuning and NVIDIA driver management.
  • Strong experience running Kubernetes with GPU workloads or using HPC schedulers like Slurm.
  • Software engineering proficiency in Python, Go, or Bash, and experience with Infrastructure-as-Code.

Nice to have

  • Experience with high-performance parallel file systems such as VAST, Weka, or Lustre.
  • Proven track record in profiling and optimizing distributed training performance (MFU).
  • Experience in physical cluster design, including rack layout and network topology.
  • Previous experience leading or mentoring other infrastructure engineers.

Culture & Benefits

  • High-impact role with significant ownership and autonomy to shape foundational AI systems.
  • Opportunity to architect the infrastructure backbone for reliable, scalable AI compute.
  • Collaboration with world-class AI labs and data center providers.
  • Inclusive, equal-opportunity work environment.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →