Senior Director Fleet Reliability Operations

212 000 - 311 000$

Формат работы

hybrid

Тип работы

fulltime

Грейд

director

Английский

Страна

Описание вакансии

Текст:

TL;DR

Senior Director Fleet Reliability Operations (System Engineering): Lead the evolution and management of a global GPU server fleet with an accent on automation, resilience, and scale. Focus on architecting scalable, reliable, and automated infrastructure systems for supercomputing clusters and leading a high-performing global operations team.

Location: Hybrid in Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA, USA

Salary: $212,000–$311,000

Company

hirify.global is a publicly traded cloud infrastructure company specializing in AI-focused supercomputing platforms, delivering high-performance GPU server fleets for AI labs, startups, and enterprises.

What you will do

Lead and grow a global management team for fleet reliability operations.
Develop and drive the Fleet Operations roadmap prioritizing automation, resilience, and scale.
Collaborate cross-functionally with hardware, platform, network, data center, and vendor teams.
Champion operational excellence, metrics, and blameless incident response.
Drive an automation-first strategy to reduce toil and increase innovation.
Cultivate a culture of reliability, mentorship, and continuous improvement.

Requirements

Must be a U.S. person or eligible to access export controlled information per U.S. Government regulations.
10+ years experience in infrastructure, platform engineering, SRE, or DevOps.
5+ years leadership managing mission-critical global production environments.
Deep technical knowledge of data center operations, fleet provisioning, lifecycle management, and observability tooling.
Strong automation, monitoring, and scalable fleet management skills.
Effective communicator and collaborator across complex cross-functional teams.

Nice to have

Experience managing global GPU or HPC dense compute infrastructure fleets.
Background in architecture or development of infrastructure management platforms and workflows.
Prior roles owning uptime, incident response, or reliability engineering in hyperscale environments.

Culture & Benefits

Comprehensive medical, dental, vision insurance fully paid by employer.
Company-paid life insurance and disability coverage.
Flexible spending and health savings accounts.
Tuition reimbursement and employee stock purchase program participation.
Paid parental leave, flexible PTO, and childcare support.
401(k) with employer match and casual, innovative work environment.

Hiring process

Onboarding at one of the company hubs within the first month.
Quarterly team gatherings to support collaboration.
Reasonable accommodations provided for candidates with disabilities upon request.