Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI): Own production reliability and platform engineering for user-facing AI products Devin and with an accent on SLOs, incident response, and CI/CD pipelines. Focus on building monitoring and observability systems, automating toil reduction, and ensuring infrastructure scales with hundreds of thousands of daily users.
Location: On-site in San Francisco Bay Area
Company
Applied AI lab building end-to-end software agents like Devin, the first AI software engineer, and , an AI-native IDE.
What you will do
- Define and own SLOs, SLIs, error budgets, monitoring, alerting, and observability for Devin and .
- Lead incident response, run blameless postmortems, and build runbooks and tooling for sustainable on-call.
- Own deployment pipelines, release infrastructure, CI/CD, and internal developer tooling to enable fast shipping.
- Manage cloud infrastructure as code with reproducible, version-controlled environments.
- Perform capacity planning, performance profiling, and growth modeling.
- Integrate security into reliability practices and foster reliability culture across product and engineering teams.
Requirements
- Deep experience running production systems at scale: SLOs, error budgets, on-call rotations, incident command.
- Strong software engineering fundamentals; write real code.
- Proficiency with cloud infrastructure (AWS, GCP, or Azure), Kubernetes, Terraform or equivalent IaC.
- Experience building and owning CI/CD pipelines and deployment infrastructure.
- Strong observability skills: instrumentation, dashboards, effective alerting.
- Track record of systematic toil reduction through automation.
- Comfort owning incidents end-to-end and product empathy for user-facing reliability.
Nice to have
- Experience with developer-facing products or platforms.
Culture & Benefits
- Small, talent-dense team of competitive programmers, founders, and AI researchers from Scale AI, Palantir, Cursor, Google DeepMind.
- High ownership and trust: set your own reliability standards.
- Proactive, systematic environment treating reliability as a craft.
- Ship products used by hundreds of thousands of developers daily.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →