Senior Site Reliability Engineer

190 000 - 206 000$

Формат работы

remote (только USA)/onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer: Design and operate highly scalable, fault-tolerant systems in a distributed cloud environment with an accent on reliability, observability, and automation. Focus on defining SLOs/SLIs, building observability platforms, automating toil reduction, and driving incident response improvements for production workloads.

San Francisco, CA or Remote (USA)

Base Salary $190K – $206K • Offers Equity

Company

hirify.global builds software to automate and streamline assurance and audit work in cybersecurity, privacy, and financial audit, enabling trust in global commerce and capital markets.

What you will do

Design and operate scalable, fault-tolerant systems supporting production workloads in distributed cloud environments.
Define and implement SLOs, SLIs, and error budgets to guide reliability decisions.
Build and improve observability systems including metrics, logs, and tracing.
Lead reliability improvements through capacity planning, load testing, performance tuning, and automation of operational processes.
Partner with engineering teams to embed reliability and scalability in system design from the start.
Participate in incident response, on-call rotations, postmortems, disaster recovery, and chaos testing.
Establish best practices for monitoring, alerting, and incident management.

Requirements

5+ years in site reliability engineering, infrastructure, or related software engineering.
Strong experience operating and scaling distributed systems in cloud environments, AWS preferred.
Hands-on with observability platforms (Datadog, Prometheus, Grafana, CloudWatch).
Experience defining SLOs/SLIs to drive engineering priorities.
Proficiency with Infrastructure as Code (Terraform or equivalent).
Deep knowledge of system performance, reliability patterns, and distributed failure modes.
Experience with production support, on-call rotations, and incident response.
Proficiency in at least one programming/scripting language for automation.
Strong communication and collaboration skills across teams.

Nice to have

Experience with distributed tracing (OpenTelemetry).
Capacity planning and performance benchmarking at scale.
Database performance tuning and observability in high-traffic systems.
Exposure to regulated environments (SOC 2, FedRAMP).
Chaos engineering practices.

Culture & Benefits

Remote-first company with flexible PTO and work schedules.
Competitive compensation with equity and ownership.
401k and wellness benefits including free therapy sessions.
Technology and work-from-home reimbursements.
Values: Fearless, Fast, Lovable, Owners, Win-win, Inclusive.
Inclusive, driven, humble, and supportive team culture focused on growth.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →