Назад
Company hidden
2 дня назад

Senior Manager Of Cloud Platform And Site Reliability (AI)

165 000 - 330 000$
Формат работы
hybrid
Тип работы
fulltime
Грейд
head
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Manager of Cloud Platform and Site Reliability (AI Infrastructure): Leading and growing the infrastructure organization that powers a machine learning platform with an accent on multi-cloud capacity, GPU inference infrastructure, and reliability standards. Focus on managing team leads, establishing org-wide SLOs/SLIs, and scaling cloud infrastructure to support frontier AI models.

Location: Hybrid (San Francisco)

Salary: $165,000 – $330,000 + Equity

Company

hirify.global powers mission-critical inference for the world's most dynamic AI companies by uniting applied AI research, flexible infrastructure, and seamless developer tooling.

What you will do

  • Lead, grow, and develop team leads across the Cloud Platform and Site Reliability Engineering organizations.
  • Set the technical direction and multi-year roadmap for infrastructure, reliability, and platform engineering.
  • Own the platform's reliability posture, establishing standards for SLOs/SLIs, incident response, and observability-as-code.
  • Collaborate with product and engineering teams to align infrastructure capabilities with enterprise customer requirements.
  • Oversee high-severity incident management and escalation processes to ensure rapid resolution and systemic follow-through.
  • Ensure the consistent adoption of best practices for Kubernetes, IaC, GitOps, and cloud resource management.

Requirements

  • Proven experience managing managers and leading multiple high-performing infrastructure or SRE teams in a high-growth environment.
  • Deep technical expertise in Kubernetes (multi-cloud across EKS, GKE) and distributed systems.
  • Hands-on background with infrastructure-as-code (Terraform, Pulumi) and CI/CD tooling (GitHub Actions, GitLab CI, Jenkins).
  • Strong foundation in observability tooling (Prometheus, VictoriaMetrics, Grafana, OpenTelemetry).
  • Experience owning incident management and enterprise SLAs at scale, including executive-level communication.
  • Bachelor's, Master's, or Ph.D. degree in Computer Science, Engineering, Mathematics, or a related field.

Nice to have

  • Experience with GPU infrastructure, including fractional GPU provisioning and multi-node model serving (H100s, B200s).
  • Familiarity with running high-performance AI models and troubleshooting ML pipelines.
  • Experience with incident management platforms like incident.io or PagerDuty.
  • Proven track record of scaling an SRE practice and building self-healing automations.

Culture & Benefits

  • Competitive compensation with meaningful equity grants.
  • 100% coverage of medical, dental, and vision insurance for employees and dependents.
  • Flexible PTO policy, including a company-wide Winter Break from Christmas Eve to New Year's Day.
  • Company-facilitated 401(k) and paid parental leave.
  • Fertility and family-building stipend through Carrot.
  • Opportunity to work with a variety of cutting-edge ML startups.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →