TL;DR
Staff Site Reliability Engineer (AWS): Improving and protecting the reliability, performance, and operability of production systems within an AWS-based infrastructure with an accent on modern SRE practices, end-to-end observability, and incident management. Focus on leading multi-engineer reliability initiatives, designing reliable services, and building tooling and automation for high-quality delivery.
Location: Hybrid in Toronto, Canada (Tuesday-Thursday in-office, Monday/Friday WFH). Also open to remote applicants from specific US states (excluding Alabama, Alaska, Connecticut, Hawaii, Kentucky, Mississippi, Nebraska, New Mexico, North Dakota, Rhode Island, South Dakota, West Virginia, and Wyoming).
Salary: $170,000–$220,000 CAD base salary range + annual bonus
Company
hirify.global is building a software platform that empowers commercial contractors, having recently achieved a $1 billion valuation and raised over $275M in funding.
What you will do
- Own end-to-end reliability domains, including strategy, roadmap, and execution.
- Drive modern SRE practices across services (SLIs/SLOs, error budgets, reliability reviews).
- Lead multi-sprint, multi-engineer reliability or performance initiatives.
- Design and maintain end-to-end observability (metrics, logs, traces, dashboards, alerts).
- Partner with product and engineering teams to design reliable services and influence system design.
- Participate in production on-call rotations and incident response for high-severity issues.
Requirements
- 8+ years of experience operating complex, user-facing SaaS systems and reliability initiatives.
- Proven experience leading multi-sprint, multi-engineer projects with clear business impact.
- Thorough understanding and hands-on experience with modern SRE practices (SLIs/SLOs, toil reduction, safe deployment, post-incident reviews).
- Strong software engineering skills with production-quality code in Python or Node.js/TypeScript.
- Deep expertise in observability (designing metrics, logging, tracing, dashboards, alerts) and tools like Datadog, Prometheus, Grafana.
- Experience working with AWS in production and Infrastructure as Code workflows (Terraform, ECS, EKS, Kubernetes).
- Incident management experience, including participating in or coordinating response and using tools like incident.io or PagerDuty.
- Ability and willingness to participate in a production on-call rotation.
Culture & Benefits
- Generous equity grant and comprehensive benefits package.
- Flexible PTO and hybrid work schedules.
- Work from home stipend.
- Company events like BBQs and team-building activities.
- Opportunities for growth and career advancement.
- Work with cutting-edge technology and innovative solutions.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →