Site Reliability Engineer

63 000 - 85 000€

Формат работы

remote (только USA)

Тип работы

fulltime

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (DevOps): Own and evolve observability and reliability systems for a large-scale distributed platform with an accent on metrics, logs, tracing, and incident response. Focus on designing SLIs, SLOs, error budgets, and building self-service tooling to improve system reliability and visibility.

Company

hirify.global builds a safe and sustainable marketplace for gamers with over 20 million active users, focusing on trust, safety, and market accessibility.

What you will do

Own and improve the observability stack using Prometheus, Thanos, Alertmanager, Loki, Sentry, Grafana, and AWS services.
Design and maintain SLIs, SLOs, error budgets to meet reliability objectives.
Enhance system visibility to reduce MTTR and improve incident response.
Build self-service capabilities for metrics, alerts, dashboards, and instrumentation.
Collaborate with Backend, DevOps, and Platform teams to embed reliability and observability from design phase.
Support incident investigations and contribute to blameless postmortems.

Requirements

Good English proficiency required.
Hands-on experience with Prometheus, Alertmanager, Grafana, Loki, Sentry or equivalents.
Experience with Thanos or large-scale metrics systems and tuning.
Strong understanding of SLIs, SLOs, error budgets, MTTR, and incident response workflows.
Experience with Kubernetes production monitoring and Infrastructure as Code (Terraform preferred).
Proficiency in scripting/programming (Go, Python, Bash) and AWS monitoring.

Nice to have

Experience designing or operating Thanos at scale.
Building self-service observability tooling or dashboards-as-code.
Knowledge of alert fatigue reduction and high-quality alerting patterns.
Experience with resilience testing, fault injection, chaos engineering.
Familiarity with service meshes and service-level reliability patterns.
Background in multi-region or global-scale systems telemetry.

Culture & Benefits

Employee Stock Options program.
Performance-based bonuses, referral bonuses, additional paid leave, personal learning budget.
Paid volunteering opportunities.
Flexible work location: office, remote, or work and travel.
Strong focus on personal and professional growth with feedback and promotion processes.

Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →