Senior Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AI): Building and optimizing the reliability and AI operations foundation for a semiconductor intelligence platform with an accent on LLM observability, agentic pipeline resilience, and multi-region AWS architecture. Focus on designing blast radius containment for AI agents, automating observability via Datadog, and scaling an Internal Developer Platform (IDP).
Location: Remote for candidates based in Canada
Salary: $125,200 - $132,500 CAD
Company
is the leading information platform providing in-depth intelligence and reverse engineering analysis for the semiconductor industry.
What you will do
- Own SLOs, SLIs, and error budgets for all production services, driving error budget discipline across engineering.
- Design reliability patterns for AI agent pipelines, including LLM observability, tool-use tracking, and graceful degradation.
- Architect blast radius containment through isolation and circuit breaking to bound customer impact from agent failures.
- Lead incident response and post-incident reviews, maturing the Canada Central/West active-active architecture toward a 24-hour RTO.
- Partner with AI Engineering on compute provisioning, model serving, inference latency, and workload isolation.
- Manage infrastructure as code via Terraform and GitOps, and oversee FinOps visibility for AWS cost segments.
Requirements
- Must be based in Canada
- Bachelor's degree in Computer Science, Engineering, or equivalent experience.
- 6–8 years of experience in SRE, platform engineering, or DevOps with demonstrated technical leadership.
- Deep expertise in AWS (EKS, Lambda, CloudWatch) and multi-region architecture patterns.
- Proficiency with Terraform, GitOps, and operational depth in Datadog.
- Strong skills in Docker, Kubernetes, Python, and Bash; understanding of Java/Spring Boot microservices.
Nice to have
- Experience designing reliability architecture for agentic AI systems and LLM-dependent services.
- AWS Professional certifications (Solutions Architect or DevOps Engineer).
- FinOps Certified Practitioner or cloud cost management experience at scale.
- Experience in semiconductor, SaaS, or data-intensive platform environments.
Culture & Benefits
- Comprehensive benefits package including health, dental, vision, and wellness.
- Financial perks such as RRSP Matching and annual fitness reimbursement.
- Flexible vacation policy and company-sponsored training and development.
- Inclusive environment prioritizing diversity, equity, and accessibility.
- High-growth environment focused on high performance and innovation.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →