Senior Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer: Design and operate highly scalable, fault-tolerant systems in a distributed cloud environment with an accent on reliability, observability, and automation. Focus on defining SLOs/SLIs, building observability platforms, automating toil reduction, and driving incident response improvements for production workloads.
San Francisco, CA or Remote (USA)
Base Salary $190K – $206K • Offers Equity
Company
builds software to automate and streamline assurance and audit work in cybersecurity, privacy, and financial audit, enabling trust in global commerce and capital markets.
What you will do
- Design and operate scalable, fault-tolerant systems supporting production workloads in distributed cloud environments.
- Define and implement SLOs, SLIs, and error budgets to guide reliability decisions.
- Build and improve observability systems including metrics, logs, and tracing.
- Lead reliability improvements through capacity planning, load testing, performance tuning, and automation of operational processes.
- Partner with engineering teams to embed reliability and scalability in system design from the start.
- Participate in incident response, on-call rotations, postmortems, disaster recovery, and chaos testing.
- Establish best practices for monitoring, alerting, and incident management.
Requirements
- 5+ years in site reliability engineering, infrastructure, or related software engineering.
- Strong experience operating and scaling distributed systems in cloud environments, AWS preferred.
- Hands-on with observability platforms (Datadog, Prometheus, Grafana, CloudWatch).
- Experience defining SLOs/SLIs to drive engineering priorities.
- Proficiency with Infrastructure as Code (Terraform or equivalent).
- Deep knowledge of system performance, reliability patterns, and distributed failure modes.
- Experience with production support, on-call rotations, and incident response.
- Proficiency in at least one programming/scripting language for automation.
- Strong communication and collaboration skills across teams.
Nice to have
- Experience with distributed tracing (OpenTelemetry).
- Capacity planning and performance benchmarking at scale.
- Database performance tuning and observability in high-traffic systems.
- Exposure to regulated environments (SOC 2, FedRAMP).
- Chaos engineering practices.
Culture & Benefits
- Remote-first company with flexible PTO and work schedules.
- Competitive compensation with equity and ownership.
- 401k and wellness benefits including free therapy sessions.
- Technology and work-from-home reimbursements.
- Values: Fearless, Fast, Lovable, Owners, Win-win, Inclusive.
- Inclusive, driven, humble, and supportive team culture focused on growth.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →