Senior Manager Of Cloud Platform And Site Reliability (AI)

165 000 - 330 000$

Формат работы

hybrid

Тип работы

fulltime

Грейд

head

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Manager of Cloud Platform and Site Reliability (AI Infrastructure): Leading and growing the infrastructure organization that powers a machine learning platform with an accent on multi-cloud capacity, GPU inference infrastructure, and reliability standards. Focus on managing team leads, establishing org-wide SLOs/SLIs, and scaling cloud infrastructure to support frontier AI models.

Location: Hybrid (San Francisco)

Salary: $165,000 – $330,000 + Equity

Company

hirify.global powers mission-critical inference for the world's most dynamic AI companies by uniting applied AI research, flexible infrastructure, and seamless developer tooling.

What you will do

Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering organizations.
Set the technical direction and multi-year roadmap for infrastructure, reliability, and platform engineering.
Own the platform's reliability posture, establishing standards for SLOs/SLIs, incident response, and observability-as-code.
Collaborate with product and engineering teams to align infrastructure capabilities with enterprise customer requirements.
Oversee high-severity incident management and escalation processes to ensure rapid resolution and systemic follow-through.
Ensure the consistent adoption of best practices for Kubernetes, IaC, GitOps, and cloud resource management.

Requirements

Proven experience managing managers and leading multiple high-performing infrastructure or SRE teams in a high-growth environment.
Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE) and distributed systems.
Hands-on background with infrastructure-as-code (Terraform, Pulumi) and CI/CD tooling (GitHub Actions, GitLab CI, Jenkins).
Strong foundation in observability tooling (Prometheus, VictoriaMetrics, Grafana, OpenTelemetry).
Experience owning incident management and enterprise SLAs at scale, including executive-level communication.
Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.

Nice to have

Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (H100s, B200s).
Familiarity with running high-performance AI models and troubleshooting ML pipelines.
Experience with incident management platforms like incident.io or PagerDuty.
Proven track record of scaling an SRE practice and building self-healing automations.

Culture & Benefits

Competitive compensation with meaningful equity grants.
100% coverage of medical, dental, and vision insurance for employees and dependents.
Flexible PTO policy, including a company-wide Winter Break from Christmas Eve to New Year's Day.
Company-facilitated 401(k) and paid parental leave.
Fertility and family-building stipend through Carrot.
Opportunity to work with a variety of cutting-edge ML startups.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →