Назад
Company hidden
2 дня назад

Site Reliability Engineer

63 000 - 85 000
Формат работы
remote (только USA)
Тип работы
fulltime
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (DevOps): Own and evolve observability and reliability systems for a large-scale distributed platform with an accent on metrics, logs, tracing, and incident response. Focus on designing SLIs, SLOs, error budgets, and building self-service tooling to improve system reliability and visibility.

Company

hirify.global builds a safe and sustainable marketplace for gamers with over 20 million active users, focusing on trust, safety, and market accessibility.

What you will do

  • Own and improve the observability stack using Prometheus, Thanos, Alertmanager, Loki, Sentry, Grafana, and AWS services.
  • Design and maintain SLIs, SLOs, error budgets to meet reliability objectives.
  • Enhance system visibility to reduce MTTR and improve incident response.
  • Build self-service capabilities for metrics, alerts, dashboards, and instrumentation.
  • Collaborate with Backend, DevOps, and Platform teams to embed reliability and observability from design phase.
  • Support incident investigations and contribute to blameless postmortems.

Requirements

  • Good English proficiency required.
  • Hands-on experience with Prometheus, Alertmanager, Grafana, Loki, Sentry or equivalents.
  • Experience with Thanos or large-scale metrics systems and tuning.
  • Strong understanding of SLIs, SLOs, error budgets, MTTR, and incident response workflows.
  • Experience with Kubernetes production monitoring and Infrastructure as Code (Terraform preferred).
  • Proficiency in scripting/programming (Go, Python, Bash) and AWS monitoring.

Nice to have

  • Experience designing or operating Thanos at scale.
  • Building self-service observability tooling or dashboards-as-code.
  • Knowledge of alert fatigue reduction and high-quality alerting patterns.
  • Experience with resilience testing, fault injection, chaos engineering.
  • Familiarity with service meshes and service-level reliability patterns.
  • Background in multi-region or global-scale systems telemetry.

Culture & Benefits

  • Employee Stock Options program.
  • Performance-based bonuses, referral bonuses, additional paid leave, personal learning budget.
  • Paid volunteering opportunities.
  • Flexible work location: office, remote, or work and travel.
  • Strong focus on personal and professional growth with feedback and promotion processes.

Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →