Senior Site Reliability Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AI Infrastructure): Designing and operating large-scale GPU infrastructure for distributed training and inference with an accent on high-performance networking and hardware reliability. Focus on optimizing GPU cluster architecture, diagnosing fabric-level issues, and building production-grade automation for AI workloads.
Location: Global Remote / San Francisco, CA
Company
builds the liquidity layer for global AI compute, providing scaled infrastructure for early-stage startups and leading AI labs.
What you will do
- Design and evolve multi-provider, multi-region GPU compute clusters optimized for large-scale training.
- Act as the primary technical partner for customers, onboarding and optimizing their large-scale training workloads.
- Define SLOs and error budgets tailored to GPU-specific failure modes such as ECC errors and NVLink degradation.
- Manage the health and performance of high-speed interconnects including InfiniBand, RoCE, and NVLink.
- Build deep observability for GPU utilization, memory pressure, and interconnect throughput.
- Develop production-grade automation for cluster provisioning, health checks, and firmware lifecycle management.
Requirements
- Deep hands-on experience operating large-scale GPU clusters (NVIDIA A100/H100/B200).
- Production experience with InfiniBand, RoCE, or NVLink fabrics for distributed training.
- Knowledge of ML frameworks and systems-level training operations (NCCL, CUDA, PyTorch, DeepSpeed, FSDP).
- Expert-level Linux skills, including kernel tuning and NVIDIA driver management.
- Strong experience running Kubernetes with GPU workloads or using HPC schedulers like Slurm.
- Software engineering proficiency in Python, Go, or Bash, and experience with Infrastructure-as-Code.
Nice to have
- Experience with high-performance parallel file systems such as VAST, Weka, or Lustre.
- Proven track record in profiling and optimizing distributed training performance (MFU).
- Experience in physical cluster design, including rack layout and network topology.
- Previous experience leading or mentoring other infrastructure engineers.
Culture & Benefits
- High-impact role with significant ownership and autonomy to shape foundational AI systems.
- Opportunity to architect the infrastructure backbone for reliable, scalable AI compute.
- Collaboration with world-class AI labs and data center providers.
- Inclusive, equal-opportunity work environment.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →