Назад
Company hidden
2 дня назад

Staff Site Reliability Engineer (Incident Management)

133 700 - 248 300CAD
Формат работы
remote (только Canada)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
Canada
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Site Reliability Engineer (SRE/Incident Management): Driving proactive reliability improvements and incident response strategies for a multi-cloud streaming platform with an accent on systemic failure analysis and automation. Focus on building reliability tooling, defining SLO/SLA frameworks, and coaching teams through post-mortems to reduce incident recurrence.

Location: Remote (Canada). Must have the ability to work in Canada without sponsorship

Salary: $133,700 – $248,300 per year

Company

hirify.global Software builds AI-powered, cloud-native products that drive digital transformation for global businesses.

What you will do

  • Analyze systemic failure patterns and design reliability improvements to prevent incident recurrence.
  • Own and optimize incident management tooling, including Rootly, PagerDuty, Jira, and Slack integrations.
  • Define and maintain SLO/SLA frameworks, utilizing error budgets to prioritize reliability investments.
  • Lead the evolution of incident response standards and practices across the engineering organization.
  • Review and edit customer-facing incident documents (CRCAs) to ensure clarity and quality.
  • Develop training programs and coach engineering teams through the post-mortem process.

Requirements

  • 10+ years of relevant experience in SRE, incident management, or reliability engineering.
  • Professional experience with at least one major cloud provider: AWS, GCP, or Azure.
  • Experience managing reliability programs within organizations of 500+ engineers.
  • Deep expertise with incident management tools such as Rootly or PagerDuty.
  • Strong understanding of distributed systems and failure modes at scale.
  • Must have the ability to work in Canada without sponsorship.

Nice to have

  • Expertise in Kafka or event streaming technologies.
  • Advanced knowledge of cloud-based infrastructure and resiliency engineering.
  • Proficiency in scripting languages and automation tools to optimize system performance.

Culture & Benefits

  • Global team structure with follow-the-sun coverage to ensure sustainable working hours.
  • Culture of curiosity, collaboration, and continuous learning.
  • Environment that encourages experimentation and professional growth.
  • Commitment to equal opportunity and inclusivity.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →