19 часов назад
Staff SRE (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Staff SRE (AI): Owning the reliability of Andromeda's infrastructure end to end, from node racking to customer-facing training runs with an accent on GPU infrastructure at scale and multi-thousand-GPU run health. Focus on leading incident responses, production operations of GPU fleets, and building observability & health systems.
Location: North America Remote / San Francisco, CA
Company
gives early-stage startups access to scaled AI infrastructure.
What you will do
- Carry the pager and lead incident responses for top-customer training runs and multi-cluster incidents.
- Own the day-to-day health of thousands of GPUs across providers and generations, including node lifecycle, validation, and repair workflows.
- Build and own telemetry, GPU health checks, fabric monitoring, and automated remediation.
- Define on-call practices, including rotations, escalation, runbooks, and incident command.
- Be the senior reliability voice with sophisticated AI infra customers and providers.
- Partner with the product team to design with SLOs, error budgets, and failure modes in mind.
Requirements
- Multiple years building and operating large-scale GPU infrastructure.
- A clear history of owning the reliability of load-bearing infrastructure.
- Deep, hands-on experience with NVIDIA H100/H200/B200/GB200 at scale.
- Real production experience with InfiniBand, RoCE, and NVLink fabrics for distributed training.
- Working knowledge of how large training jobs run, including NCCL, CUDA, and PyTorch distributed.
- Strong Go, Python, or Rust skills for building production tooling and automation.
- Expert-level Linux and systems internals knowledge.
- Comfortable being the senior engineer on a P0 bridge with customers and providers.
- Comfortable being the senior technical voice with AI infra customers, providers, and prospects.
Nice to have
- Experience building custom GPU health systems, fleet managers, or fabric controllers.
- Distributed storage depth and opinions on checkpoint I/O patterns at scale.
- Experience profiling and diagnosing distributed training.
- Experience as the senior SRE partner in enterprise relationships for AI infrastructure or HPC.
- Open-source contributions in the GPU/AI infra stack.
- Public talks, writing, or community presence in the GPU/AI infra industry.
Culture & Benefits
- Significant autonomy and leverage working on infrastructure depended on by ambitious AI labs.
- Opportunity to stay hands-on in code, customer interactions, and critical incidents.
- Equal opportunity employer committed to diversity and inclusion.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →