Эта вакансия в архиве
Посмотреть похожие вакансии ↓обновлено 2 месяца назад
Senior Director Fleet Reliability Operations
212 000 - 311 000$
Описание вакансии
Текст:
TL;DR
Senior Director Fleet Reliability Operations (System Engineering): Lead the evolution and management of a global GPU server fleet with an accent on automation, resilience, and scale. Focus on architecting scalable, reliable, and automated infrastructure systems for supercomputing clusters and leading a high-performing global operations team.
Location: Hybrid in Livingston, NJ / New York, NY / Sunnyvale, CA / Bellevue, WA, USA
Salary: $212,000–$311,000
Company
is a publicly traded cloud infrastructure company specializing in AI-focused supercomputing platforms, delivering high-performance GPU server fleets for AI labs, startups, and enterprises.
What you will do
- Lead and grow a global management team for fleet reliability operations.
- Develop and drive the Fleet Operations roadmap prioritizing automation, resilience, and scale.
- Collaborate cross-functionally with hardware, platform, network, data center, and vendor teams.
- Champion operational excellence, metrics, and blameless incident response.
- Drive an automation-first strategy to reduce toil and increase innovation.
- Cultivate a culture of reliability, mentorship, and continuous improvement.
Requirements
- Must be a U.S. person or eligible to access export controlled information per U.S. Government regulations.
- 10+ years experience in infrastructure, platform engineering, SRE, or DevOps.
- 5+ years leadership managing mission-critical global production environments.
- Deep technical knowledge of data center operations, fleet provisioning, lifecycle management, and observability tooling.
- Strong automation, monitoring, and scalable fleet management skills.
- Effective communicator and collaborator across complex cross-functional teams.
Nice to have
- Experience managing global GPU or HPC dense compute infrastructure fleets.
- Background in architecture or development of infrastructure management platforms and workflows.
- Prior roles owning uptime, incident response, or reliability engineering in hyperscale environments.
Culture & Benefits
- Comprehensive medical, dental, vision insurance fully paid by employer.
- Company-paid life insurance and disability coverage.
- Flexible spending and health savings accounts.
- Tuition reimbursement and employee stock purchase program participation.
- Paid parental leave, flexible PTO, and childcare support.
- 401(k) with employer match and casual, innovative work environment.
Hiring process
- Onboarding at one of the company hubs within the first month.
- Quarterly team gatherings to support collaboration.
- Reasonable accommodations provided for candidates with disabilities upon request.