Senior Manager, Production Engineering (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Manager, Production Engineering (AI): Leading and expanding the SRE team to ensure the reliability and performance of a large-scale AI cloud platform with an accent on operational excellence and automation. Focus on designing incident management processes, scaling distributed infrastructure, and implementing self-healing systems.
Location: Must be based in the US (Livingston, NJ; New York, NY; San Francisco, CA; Sunnyvale, CA; or Bellevue, WA). Must be a U.S. person (citizen, green card holder, etc.) for export control compliance.
Salary: $207,000 – $275,000
Company
is the Essential Cloud for AI, providing high-performance infrastructure and technical expertise for AI labs, startups, and global enterprises.
What you will do
- Execute the SRE vision and roadmap for large-scale, distributed cloud infrastructure.
- Lead and mentor a high-performing team of SREs, promoting a culture of ownership and continuous learning.
- Champion automation-first practices using AI, Terraform, Kubernetes, and Infrastructure-as-Code.
- Establish and evolve Operational Excellence best practices to ensure platform proactivity.
- Drive initiatives for incident management, root cause analysis, and system hardening.
- Collaborate with engineering and product teams to build scalable, resilient, and self-healing systems.
Requirements
- Bachelor’s degree in Computer Science, Engineering, or related field.
- 10+ years in leadership or senior management at a cloud provider, hyperscaler, or high-growth tech company.
- Experience managing geographically distributed 24x7 engineering teams.
- Expertise in designing incident management processes (on-call rotations, SLO/SLA frameworks, postmortems).
- Deep understanding of distributed systems, networking, and storage architecture.
- Must be a U.S. person (U.S. citizen, national, lawful permanent resident, refugee, or asylee) to comply with export control regulations.
Nice to have
- Experience with GPU-accelerated workloads, resource isolation, and performance tuning.
- Prior leadership in bare metal infrastructure environments (custom data centers, HPC clusters).
- Working knowledge of DPUs, service mesh architectures, and multi-tenant security models.
- Experience in AI infrastructure supporting training or inference at scale.
Culture & Benefits
- Comprehensive health, dental, and vision insurance (100% paid by the company).
- 401(k) with a generous employer match and ESPP participation.
- Equity awards and discretionary bonuses.
- Flexible PTO and paid parental leave.
- Daily catered lunch at office and data center locations.
- Support for mental wellness and family-forming (Spring Health, Carrot).
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →