Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI): Establishing and managing reliability, observability, and incident response for AI services with an accent on SLOs, error budgets, and operational readiness reviews. Focus on automating toil, monitoring AI-specific failure modes, and ensuring production scalability.
Location: Remote (100% telework). Must be a US citizen with a DoD Secret security clearance required at time of hire.
Salary: $142,696 – $158,303 per year
Company
develops a diverse portfolio of high-technology solutions and products for defense and scientific missions.
What you will do
- Define and manage SLOs and error budgets for every AI service moving to production.
- Build and maintain monitoring, logging, and alerting infrastructure to detect degradations.
- Establish incident management procedures, lead post-incident reviews, and drive corrective actions.
- Perform operational readiness reviews to ensure services meet reliability, security, and operational standards.
- Forecast capacity needs and monitor costs associated with tokens, compute, and storage.
- Identify and automate repetitive operational tasks to eliminate toil.
Requirements
- Bachelor's or Master's degree in Software Engineering, Computer Science, or a related STEM field.
- 5-8+ years of production SRE or DevOps experience owning system reliability.
- Hands-on experience with Prometheus, Grafana, Datadog, ELK, or CloudWatch.
- Strong scripting and automation skills using Python, Bash, and IaC (Terraform or CloudFormation).
- Experience with Docker and Kubernetes orchestration at scale.
- US citizenship and Department of Defense Secret security clearance are required at time of hire.
Nice to have
- Experience with AI/ML production systems, including model serving and inference monitoring.
- Multi-cloud experience across AWS, Azure, and GCP.
- Familiarity with Google SRE principles and their practical application.
- Experience in high-compliance environments such as defense, healthcare, or finance.
Culture & Benefits
- 100% telework (fully remote) flexibility.
- 9/80 work schedule.
- Highly competitive benefits package.
- Collaborative environment valuing trust, honesty, and transparency.
- Opportunity to define SRE practices from scratch for an AI platform.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →