Senior Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer: Building tools, automation, and observability for resilient high-scale systems supporting fan engagement platforms with an accent on metrics, alerting, and incident response. Focus on defining SLIs/SLOs, streamlining CI/CD pipelines, automating reliability checks, and driving operational excellence through blameless postmortems and capacity planning.
Location: Remote (US-based, US work authorization required). Hybrid/flexible work environment.
Company
Growth-stage company providing fan engagement platforms for high school sports, including ticketing, streaming, fundraising, and more, trusted by thousands of US schools.
What you will do
- Assess and improve system visibility by reviewing dashboards, metrics, logs, and implementing targeted enhancements.
- Refine monitoring, alerting, and dashboards for critical services to enable faster issue detection and response.
- Integrate observability and telemetry into build, deploy, and release processes.
- Define SLIs/SLOs for core user flows and align teams on reliability standards.
- Streamline incident response, automate routine tasks, and participate in on-call rotations.
- Partner with engineering teams to implement reliability best practices, release automation, and proactive incident prevention.
Requirements
- Solid experience in Python for automation and operational tasks
- Proficiency in at least one of Java, C++, or Go
- Strong knowledge of Linux, cloud infrastructure (AWS, GCP, Azure), Docker, Kubernetes, Terraform
- Experience with CI/CD pipelines, version control, automated testing, observability tools (Prometheus, Grafana, ELK, Datadog)
- Proven experience with SLAs/SLOs, critical user journeys, incident facilitation, and cross-functional collaboration
- Problem-solving mindset treating reliability as a shared responsibility
Nice to have
- Experience with end-to-end/integration tests, performance testing, chaos engineering
- Contributions to developer tooling or reliability frameworks
- Exposure to security, compliance, change management
- Relevant certifications
Culture & Benefits
- Accountability, collaboration, growth, and fairness-focused culture
- Multiple medical, dental, vision, life, and disability insurance plans
- 401K with company match, company equity (stock options), Employee Emergency Fund
- Open PTO policy
- Must be full-time employee for health benefits eligibility
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →