Operations Manager, Fleet Reliability (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Operations Manager, Fleet Reliability (Infrastructure): Leading a 24/7 team focused on provisioning, updating, and triaging server nodes for a high-scale AI cloud platform with an accent on fleet reliability and observability. Focus on driving automation for node delivery, improving incident management processes, and scaling the fleet lifecycle.
Location: Hybrid in Bellevue, WA; Remote may be considered for candidates based in the US located more than 30 miles from an office. Must be a U.S. person (citizen, green card holder, etc.) for export control compliance.
Salary: $143,000 – $210,000
Company
is a specialized cloud provider delivering high-performance infrastructure to enable innovators to build and scale AI.
What you will do
- Build and lead a 24/7 team of reliability and observability-focused engineers.
- Develop and document consistent processes for provisioning, validating, and troubleshooting server nodes in the fleet.
- Advocate for process and automation improvements, prioritizing event-driven automated remediation.
- Provide 24/7 engineering support for high-criticality, time-sensitive node delivery and maintenance.
- Drive onboarding, documentation, enablement, and performance management to foster team growth.
- Cultivate a culture of collaboration and effective communication within the team and across the organization.
Requirements
- 7+ years of experience in software or infrastructure engineering.
- 2+ years of experience in a leadership capacity.
- Background in SRE fundamentals, incident management, observability, and change management.
- Strong commitment to automation and adoption of cross-team tooling.
- Must be a U.S. person (U.S. citizen, national, lawful permanent resident, refugee, or asylee) due to export control regulations.
Culture & Benefits
- 100% company-paid medical, dental, and vision insurance.
- 401(k) with generous employer match and Employee Stock Purchase Program (ESPP).
- Flexible PTO and paid parental leave.
- Mental wellness benefits through Spring Health and family-forming support via Carrot.
- Flexible, full-service childcare support with Kinside.
- Catered daily lunch in office and data center locations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →