Senior Manager Of Cloud Platform And Site Reliability (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Manager of Cloud Platform and Site Reliability (AI Infrastructure): Leading and growing the infrastructure organization that powers a machine learning platform with an accent on multi-cloud capacity, GPU inference infrastructure, and reliability standards. Focus on managing team leads, establishing org-wide SLOs/SLIs, and scaling cloud infrastructure to support frontier AI models.
Location: Hybrid (San Francisco)
Salary: $165,000 – $330,000 + Equity
Company
powers mission-critical inference for the world's most dynamic AI companies by uniting applied AI research, flexible infrastructure, and seamless developer tooling.
What you will do
- Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering organizations.
- Set the technical direction and multi-year roadmap for infrastructure, reliability, and platform engineering.
- Own the platform's reliability posture, establishing standards for SLOs/SLIs, incident response, and observability-as-code.
- Collaborate with product and engineering teams to align infrastructure capabilities with enterprise customer requirements.
- Oversee high-severity incident management and escalation processes to ensure rapid resolution and systemic follow-through.
- Ensure the consistent adoption of best practices for Kubernetes, IaC, GitOps, and cloud resource management.
Requirements
- Proven experience managing managers and leading multiple high-performing infrastructure or SRE teams in a high-growth environment.
- Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE) and distributed systems.
- Hands-on background with infrastructure-as-code (Terraform, Pulumi) and CI/CD tooling (GitHub Actions, GitLab CI, Jenkins).
- Strong foundation in observability tooling (Prometheus, VictoriaMetrics, Grafana, OpenTelemetry).
- Experience owning incident management and enterprise SLAs at scale, including executive-level communication.
- Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.
Nice to have
- Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (H100s, B200s).
- Familiarity with running high-performance AI models and troubleshooting ML pipelines.
- Experience with incident management platforms like incident.io or PagerDuty.
- Proven track record of scaling an SRE practice and building self-healing automations.
Culture & Benefits
- Competitive compensation with meaningful equity grants.
- 100% coverage of medical, dental, and vision insurance for employees and dependents.
- Flexible PTO policy, including a company-wide Winter Break from Christmas Eve to New Year's Day.
- Company-facilitated 401(k) and paid parental leave.
- Fertility and family-building stipend through Carrot.
- Opportunity to work with a variety of cutting-edge ML startups.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →