Senior Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AWS/AI): Building and scaling the reliability and AI operations foundation for a semiconductor intelligence platform with an accent on LLM observability, blast radius containment, and automated recovery. Focus on designing reliability patterns for AI agent pipelines, maturing active-active architectures, and establishing a comprehensive Internal Developer Platform.
Location: Remote for candidates based in the United Kingdom
Salary: £77,600 – £82,200 GBP
Company
is an information platform providing in-depth intelligence and reverse engineering analysis for the semiconductor industry.
What you will do
- Own SLOs, SLIs, and error budgets for production services and drive error budget discipline across engineering.
- Design reliability patterns for AI agent pipelines, including LLM observability, tool-use tracking, and failure detection.
- Architect blast radius containment and mature Canada Central/West active-active architecture toward 24-hour RTO.
- Lead CI/CD pipeline strategy using Bitbucket Pipelines and GitHub Actions to optimize deployment frequency.
- Operate Datadog for service health and extend observability to AI workloads like token consumption and agent completion rates.
- Mentor junior and intermediate SRE engineers and drive IDP adoption via Backstage or Atlassian Compass.
Requirements
- 6–8 years of progressive experience in SRE, platform engineering, or DevOps with technical leadership.
- Deep expertise in AWS (EKS, Lambda, CloudWatch) and multi-region architecture patterns.
- Proficiency with Terraform, GitOps, and policy-as-code (Sentinel, OPA/Rego).
- Hands-on operational depth in Datadog, including dashboards, SLO tracking, and distributed tracing.
- Strong containerization expertise with Docker and Kubernetes (EKS preferred).
- Must be based in the United Kingdom.
Nice to have
- Experience designing reliability architecture for agentic AI systems and LLM-dependent services.
- AWS Professional certifications (Solutions Architect or DevOps Engineer).
- FinOps Certified Practitioner or cloud cost management experience at scale.
- Experience in semiconductor, SaaS, or data-intensive platform environments.
Culture & Benefits
- Company-sponsored training and development opportunities.
- Comprehensive benefits package including health, dental, vision, wellness, and retirement.
- Flexible vacation policy and annual fitness reimbursement.
- Inclusive environment prioritizing diversity, equity, and accessibility.
- Community involvement opportunities through charitable alliances.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →