TL;DR
Senior Staff Technical Program Manager (Reliability): Leading strategy, execution, and continuous improvement of critical reliability initiatives across infrastructure and product engineering teams for a data and AI infrastructure platform with an accent on multi-cloud infrastructure, operational excellence, and distributed systems. Focus on defining reliability strategy, setting long-term goals, executing multi-quarter programs, and driving adoption of best practices.
Location: Mountain View, California; San Francisco, California
Salary: $189,800—$256,160 USD
Company
hirify.global is the data and AI company, empowering data teams to solve complex challenges by building and operating the world's best data and AI infrastructure platform.
What you will do
- Define and lead the long-term Reliability roadmap and strategy with senior engineering leadership.
- Drive end-to-end execution of critical Reliability programs, including planning, risk management, and delivery.
- Partner with engineering teams to influence technical direction and make sound design decisions for infrastructure and distributed systems.
- Identify process and architecture gaps, driving improvements in scalability, fault tolerance, and automation.
- Elevate reliability culture by driving adoption of best practices (error budgets, incident reviews) and implementing program governance.
Requirements
- 10+ years of experience managing and delivering large-scale technical programs in cloud infrastructure, distributed systems, SRE, or platform engineering environments.
- Experience developing infrastructure at two or more hyperscale cloud providers (e.g., AWS, Azure, GCP).
- Demonstrated success leading Reliability Programs at scale (availability, failover, incident reduction).
- Strong understanding of infrastructure, distributed systems, or SRE practices; previous engineering or SRE experience preferred.
- Ability to translate ambiguous goals into actionable program plans with clear milestones and KPIs.
- Experience managing complex cross-organizational dependencies and multi-quarter timelines.
Nice to have
- Background in distributed systems engineering, SRE, platform infrastructure, or cloud services.
- Experience with large-scale compute fleets, container orchestration, or autoscaling.
- Familiarity with reliability methodologies such as SLOs, error budgets, chaos engineering, and incident management.
Culture & Benefits
- Comprehensive benefits and perks offered.
- Commitment to fostering a diverse and inclusive culture.
- Opportunities to work on a world-class data and AI infrastructure platform.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →