TL;DR
Site Reliability Engineer (DevOps): Own and evolve observability and reliability systems for a large-scale distributed platform with an accent on metrics, logs, tracing, and incident response. Focus on designing SLIs, SLOs, error budgets, and building self-service tooling to improve system reliability and visibility.
Company
hirify.global builds a safe and sustainable marketplace for gamers with over 20 million active users, focusing on trust, safety, and market accessibility.
What you will do
- Own and improve the observability stack using Prometheus, Thanos, Alertmanager, Loki, Sentry, Grafana, and AWS services.
- Design and maintain SLIs, SLOs, error budgets to meet reliability objectives.
- Enhance system visibility to reduce MTTR and improve incident response.
- Build self-service capabilities for metrics, alerts, dashboards, and instrumentation.
- Collaborate with Backend, DevOps, and Platform teams to embed reliability and observability from design phase.
- Support incident investigations and contribute to blameless postmortems.
Requirements
- Good English proficiency required.
- Hands-on experience with Prometheus, Alertmanager, Grafana, Loki, Sentry or equivalents.
- Experience with Thanos or large-scale metrics systems and tuning.
- Strong understanding of SLIs, SLOs, error budgets, MTTR, and incident response workflows.
- Experience with Kubernetes production monitoring and Infrastructure as Code (Terraform preferred).
- Proficiency in scripting/programming (Go, Python, Bash) and AWS monitoring.
Nice to have
- Experience designing or operating Thanos at scale.
- Building self-service observability tooling or dashboards-as-code.
- Knowledge of alert fatigue reduction and high-quality alerting patterns.
- Experience with resilience testing, fault injection, chaos engineering.
- Familiarity with service meshes and service-level reliability patterns.
- Background in multi-region or global-scale systems telemetry.
Culture & Benefits
- Employee Stock Options program.
- Performance-based bonuses, referral bonuses, additional paid leave, personal learning budget.
- Paid volunteering opportunities.
- Flexible work location: office, remote, or work and travel.
- Strong focus on personal and professional growth with feedback and promotion processes.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →