Senior Staff Production Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Staff Production Engineer (AI): Driving automation and observability across a multi-cloud infrastructure with an accent on reducing Mean Time to Mitigate (MTTM) and shaping scalability. Focus on implementing self-healing systems, defining SLIs/SLOs, and leading incident response to ensure the reliability of a global platform.
Location: Hybrid in San Jose, CA (3 days a week) or remote within the US.
Salary: $140,000 - $200,000 USD
Company
accelerates digital transformation by providing a cloud-native Zero Trust Exchange platform that secures connections between users, devices, and applications.
What you will do
- Design and implement highly available, scalable infrastructure across AWS, Azure, GCP, and bare-metal environments.
- Drive an "automation-first" culture by writing code (Python/Go) to eliminate manual toil and build self-healing systems.
- Implement and maintain sophisticated observability (Prometheus, Grafana, OpenTelemetry), define SLIs/SLOs, and establish error budgets.
- Act as a lead Incident Commander (TDO on-call), develop response playbooks, and conduct deep-dive post-incident analyses.
- Partner with Engineering and partner teams to conduct operability reviews.
Requirements
- 8+ years of experience managing reliability, scalability, and availability for large-scale production services.
- Deep expertise in programming (e.g., Python, Go, or C/C++).
- Strong background in networking protocols, Linux/FreeBSD systems, and distributed architecture.
- Experience in high-stakes incident management and participation in a 24/7 on-call rotation.
- Proficiency in leveraging ITIL frameworks and incident data to drive service maturity through systematic problem management and technical operability reviews.
Nice to have
- Extensive experience with public cloud (AWS, Azure, GCP) and Infrastructure-as-Code (Ansible, Terraform).
- Experience with chaos engineering and disaster recovery planning at scale.
- Expertise in global routing (BGP) and traffic tunneling (GRE, IPSec) with a deep understanding of L7 proxy architectures (HAProxy), DNS at scale, and OS networking stack internals.
Culture & Benefits
- Impact in your role matters more than title, and trust is built on results.
- Value constructive, honest debate and focus on getting to the best ideas faster.
- Build high-performing teams that can make an impact quickly and with high quality.
- Committed to building a team that reflects the communities served and the customers worked with.
- Foster an inclusive environment that values all backgrounds and perspectives, emphasizing collaboration and belonging.
- Offer comprehensive and inclusive benefits to meet the diverse needs of employees and their families throughout their life stages.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →