TL;DR
Fleet Reliability Technical Program Manager (AI): Leading fleet reliability initiatives across the full system lifecycle from provisioning to steady-state operations with an accent on identifying systemic reliability gaps and defining success metrics. Focus on driving complex, cross-functional programs to improve fleet delivery, readiness, and operational stability at scale.
Location: Hybrid in Livingston, NJ, New York, NY, Sunnyvale, CA, or Bellevue, WA. Remote work may be considered for candidates located more than 30 miles from an office, based on role requirements for specialized skill sets. Must be a U.S. person (U.S. citizen, lawful permanent resident, refugee, or asylee) due to export control regulations.
Salary: $188,000–$275,000
Company
hirify.global is a publicly traded cloud company specializing in AI infrastructure, delivering a platform of technology and tools to build and scale AI with confidence.
What you will do
- Own end-to-end fleet reliability outcomes across Day 0 (provisioning, validation, bring-up) through Day 2 (steady-state operations, incident reduction, lifecycle management).
- Identify systemic reliability gaps across hardware, firmware, software, networking, storage, and operational processes.
- Drive alignment on reliability goals, SLAs/SLOs, and error budgets across engineering and operations teams.
- Lead complex, cross-functional programs to improve fleet delivery, readiness, and operational stability.
- Establish and own fleet reliability metrics and dashboards (e.g., failure rates, MTTR, provisioning success, incident trends).
- Use data and post-incident learnings to prioritize reliability investments and drive corrective actions.
Requirements
- Bachelor's degree in Computer Engineering or a related technical field.
- 10+ years of experience in technical program management in large-scale compute infrastructure, cloud, or platform environments.
- Background in observability, monitoring, or telemetry systems (e.g., Prometheus, Grafana, OpenTelemetry).
- Proven experience driving cross-functional engineering programs focused on reliability, availability, or operational excellence.
- Strong technical aptitude across infrastructure domains (compute, storage, networking, hardware, or SRE).
- Demonstrated ability to use data and metrics to drive prioritization, execution, and decision-making.
- Must be a U.S. person (U.S. citizen, lawful permanent resident, refugee, or asylee) for Export Control Compliance.
Nice to have
- Experience operating at scale in data center, cloud infrastructure, or hyperscale environments.
- Familiarity with reliability frameworks such as SLIs/SLOs, error budgets, and incident management practices.
Culture & Benefits
- Medical, dental, and vision insurance (100% paid for by hirify.global).
- Company-paid life insurance and voluntary supplemental life insurance.
- Flexible Spending Account, Health Savings Account, and tuition reimbursement.
- Ability to participate in Employee Stock Purchase Program (ESPP).
- Mental Wellness Benefits through Spring Health and family-forming support by Carrot.
- Paid parental leave and flexible, full-service childcare support with Kinside.
- 401(k) with a generous employer match.
- Flexible PTO.
- Catered lunch each day in our office and data center locations.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →