Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer: Maintaining and scaling production systems for a global IT management platform with an accent on AWS infrastructure, SLO enforcement, and incident response. Focus on building automation, managing IaC, and improving observability to ensure high availability for thousands of MSPs.
Location: Must be based in Markham, Ontario
Salary: CAD $115,000–$130,000
Company
is a leading provider of AI-powered IT management and cybersecurity software serving MSPs and internal IT organizations worldwide.
What you will do
- Set, monitor, and enforce SLOs, SLIs, and error budgets to maintain system reliability.
- Lead incident response, troubleshooting, and blameless postmortems.
- Build and maintain automated deployment and infrastructure provisioning using Infrastructure as Code.
- Manage cloud and hybrid infrastructure with Terraform or CloudFormation.
- Improve observability through proactive monitoring, alerting, and dashboards.
- Partner with development teams to integrate reliability into the SDLC.
Requirements
- 4 to 5 years of AWS production experience
- Experience with Terraform or CloudFormation
- AWS ECS production experience
- Active on-call rotation experience with incident management
- Working fluency with SLOs, SLIs, and error budgets
Nice to have
- Kubernetes production experience
- Observability tooling (Datadog, Dynatrace, CloudWatch, ELK)
- Chaos engineering experience
- AWS Lambda or serverless workloads
- DevSecOps experience (SOC2, ISO 27001)
Culture & Benefits
- High-growth, high-performance environment
- Focus on innovation and results
- Opportunities to work with large-scale global infrastructure
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →