Senior Production Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Production Engineer (AI/Cloud): Building and operating high-reliability cloud infrastructure for AI workloads with an accent on operational safety, delivery velocity, and service availability. Focus on designing automated remediation, scaling Kubernetes clusters, and eliminating single points of failure in distributed systems.
Location: Hybrid (Livingston, NJ; New York, NY; Sunnyvale, CA; Bellevue, WA). Remote may be considered for candidates located more than 30 miles from an office. Must be a U.S. person (Citizen, Green Card, etc.) due to export control regulations.
Salary: $139,000 – $204,000
Company
is a specialized cloud provider designed to accelerate AI development through high-performance GPU infrastructure.
What you will do
- Take hands-on ownership of critical systems and frameworks, driving architecture, implementation, and long-term evolution.
- Lead end-to-end delivery of engineering projects to improve availability, scalability, and operational automation.
- Build and maintain observability, alerting, and resilience testing for supported systems.
- Drive deep root-cause investigations during incident response and implement lasting technical fixes.
- Ship production code regularly using Python, Go, or similar languages and participate in on-call rotations.
- Collaborate with platform teams to integrate reliability best practices into new features and services.
Requirements
- 7+ years of engineering experience building and operating distributed systems or cloud platforms.
- Strong proficiency in Python, Go, or similar languages for shipping production services.
- Deep knowledge of Kubernetes and cloud-native distributed system patterns.
- Experience with modern observability stacks (metrics, tracing, logs, SLOs/SLIs).
- Must be a U.S. person (Citizen, National, Lawful Permanent Resident, Refugee, or Asylee) to comply with export control regulations.
Nice to have
- Experience building internal tooling and frameworks for high-availability cloud operations.
- Background in operating large-scale AI or GPU-accelerated infrastructure.
- Familiarity with Chaos Engineering, DR/BCP, or capacity planning.
Culture & Benefits
- 100% company-paid medical, dental, and vision insurance.
- 401(k) with a generous employer match and Employee Stock Purchase Program (ESPP).
- Flexible PTO and paid parental leave.
- Catered lunch daily in office and data center locations.
- Comprehensive wellness benefits, including mental health support and family-forming assistance.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →