Staff Site Reliability Engineer (Incident Management)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (SRE/Incident Management): Driving proactive reliability improvements and incident response strategies for a multi-cloud streaming platform with an accent on systemic failure analysis and automation. Focus on building reliability tooling, defining SLO/SLA frameworks, and coaching teams through post-mortems to reduce incident recurrence.
Location: Remote (Canada). Must have the ability to work in Canada without sponsorship
Salary: $133,700 – $248,300 per year
Company
Software builds AI-powered, cloud-native products that drive digital transformation for global businesses.
What you will do
- Analyze systemic failure patterns and design reliability improvements to prevent incident recurrence.
- Own and optimize incident management tooling, including Rootly, PagerDuty, Jira, and Slack integrations.
- Define and maintain SLO/SLA frameworks, utilizing error budgets to prioritize reliability investments.
- Lead the evolution of incident response standards and practices across the engineering organization.
- Review and edit customer-facing incident documents (CRCAs) to ensure clarity and quality.
- Develop training programs and coach engineering teams through the post-mortem process.
Requirements
- 10+ years of relevant experience in SRE, incident management, or reliability engineering.
- Professional experience with at least one major cloud provider: AWS, GCP, or Azure.
- Experience managing reliability programs within organizations of 500+ engineers.
- Deep expertise with incident management tools such as Rootly or PagerDuty.
- Strong understanding of distributed systems and failure modes at scale.
- Must have the ability to work in Canada without sponsorship.
Nice to have
- Expertise in Kafka or event streaming technologies.
- Advanced knowledge of cloud-based infrastructure and resiliency engineering.
- Proficiency in scripting languages and automation tools to optimize system performance.
Culture & Benefits
- Global team structure with follow-the-sun coverage to ensure sustainable working hours.
- Culture of curiosity, collaboration, and continuous learning.
- Environment that encourages experimentation and professional growth.
- Commitment to equal opportunity and inclusivity.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →