Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI/Infrastructure): Ensuring the stability, resilience, and observability of a global AI infrastructure platform with an accent on SLI/SLO adoption and incident response. Focus on automating operational workflows, reducing toil, and improving GPU performance visibility in high-scale distributed systems.
Location: Remote, USA
Salary: $150,000 – $200,000 USD
Company
is a foundational platform for developers to build and run custom AI systems that scale, focusing on high-performance infrastructure for modern AI workloads.
What you will do
- Define and enforce reliability standards, SLIs, and SLOs across critical services.
- Lead incident response, coordinate mitigation, and conduct blameless postmortems.
- Design and improve observability systems using Prometheus, Grafana, and custom tooling.
- Automate recurring operational workflows using Python, Go, and Bash to reduce toil.
- Partner with engineering teams to enhance fault tolerance and system resilience.
- Perform production readiness reviews for new features and identify systemic risks.
Requirements
- 5+ years of experience in SRE, Reliability Engineering, or Production Engineering.
- Must be based in the USA.
- Strong expertise in Linux systems, networking, and containerized production environments.
- Deep understanding of distributed systems and failure modes.
- Proven experience in incident response leadership and managing SLIs/SLOs.
- Strong scripting/programming skills and excellent written communication.
Nice to have
- Experience with GPU infrastructure or AI/ML platforms.
- Background in high-growth, large-scale environments or startup experience.
- Familiarity with Infrastructure as Code (IaC) and GPU observability tooling.
- Experience building internal reliability platforms or frameworks.
Culture & Benefits
- Meaningful equity and stock options in a fast-growing company.
- Generous medical, dental, and vision plans.
- Flexible PTO policy to support recharge and work-life balance.
- Remote-first culture with collaborative teams utilizing Slack for communication.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →