Senior Manager, Observability (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Manager, Observability (AI): Leading the team responsible for building and scaling observability platforms for AI infrastructure with an accent on metrics, logs, traces, and telemetry pipelines. Focus on driving platform reliability, guiding architectural decisions, and ensuring system scalability in a hyper-growth cloud environment.
Location: Hybrid (Sunnyvale, CA), remote considered for candidates located more than 30 miles from an office. Must be a U.S. person (Citizen, Green Card holder, etc.) for export control compliance.
Salary: $188,000 – $275,000
Company
is a specialized cloud provider designed specifically to accelerate AI breakthroughs with high-performance infrastructure.
What you will do
- Lead the team responsible for building and operating observability systems across metrics, logs, traces, and telemetry pipelines.
- Define the observability strategy and roadmap to support rapid scaling of AI infrastructure.
- Guide architectural decisions and drive improvements in platform reliability and performance.
- Partner with infrastructure, platform, security, and application engineering teams to improve production visibility.
- Manage and scale the engineering team through strategic hiring and mentorship.
Requirements
- 8+ years of software engineering experience with production systems at scale.
- 4+ years of engineering management experience leading senior engineers and technical leads.
- Experience building observability platforms (logs, metrics, traces, alerting) in distributed systems.
- Knowledge of reliability engineering concepts including SLOs, SLIs, incident management, and error budgets.
- Experience scaling telemetry collection pipelines, storage backends, and query layers.
- Must be a U.S. person as defined by U.S. Government export regulations.
Nice to have
- Experience with OpenTelemetry, Grafana, and Prometheus-compatible systems.
- Experience operating cloud-native infrastructure and Kubernetes environments.
- Background in supporting large-scale cloud, developer platforms, or AI/ML infrastructure.
- Familiarity with capacity planning for high-ingest telemetry systems.
Culture & Benefits
- Medical, dental, and vision insurance 100% paid by the company.
- 401(k) with a generous employer match and Employee Stock Purchase Program (ESPP).
- Flexible PTO and mental wellness benefits through Spring Health.
- Comprehensive family support including paid parental leave and childcare support via Kinside.
- Catered lunch daily at office and data center locations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →