Назад
Company hidden
3 дня назад

Senior Site Reliability Engineer (Observability)

Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US/Argentina/Canada
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (Observability): Own and evolve hirify.global's observability stack, including OpenTelemetry and Datadog, to provide reliable, actionable metrics, traces, and logs across services with an accent on AI-powered agents and automation. Focus on driving adoption of SLOs, distributed tracing, structured logging, and improving incident response processes.

Location: Remote-first (United States; BC & ON, Canada; Argentina)

Company

Building the world’s leading AI-native Digital Experience Platform as a remote-first company.

What you will do

  • Join the Observability team to ensure engineers have tools, data, and practices for application and hosting health.
  • Dive into the main hirify.global application in TypeScript, Node, or Go to debug and fix production issues.
  • Build and maintain AI-powered agents to surface insights, reduce alert fatigue, and accelerate incident resolution.
  • Guide teams on instrumentation, SLOs, tracing, and logging for confident production deployments.
  • Participate in on-call, incident response, and automate workflows to reduce toil.
  • Partner with engineering teams to define and improve observability practices.

Requirements

  • Business-level fluency to read, write and speak in English
  • BS/BA or relevant experience
  • 5+ years building, maintaining, and debugging distributed systems in customer-facing environments
  • Hands-on with observability tools like Datadog, Grafana, Prometheus, Elasticsearch
  • Experience with OpenTelemetry or similar for metrics, traces, profiles, logs
  • SLOs/SLIs at scale, AWS/GCP, container architectures (Docker, Kubernetes), IaC (Terraform, Pulumi)
  • Full-stack contributions with React, Node.js, MongoDB/PostgreSQL

Nice to have

  • AI agents for observability data (root cause analysis, alerting, querying)
  • OpenTelemetry, Kubernetes, Pulumi specifically
  • Improving on-call and incident response

Culture & Benefits

  • Equity (RSUs) for permanent employees
  • Comprehensive health, dental, vision coverage
  • Paid parental leave (12 weeks), family planning support
  • Flexible vacation, holidays, sabbatical
  • Mental health resources, wellness stipends
  • 401(k) match (US), retirement support globally, annual bonus

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →