Senior Site Reliability Engineer (Observability)

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer (Observability): Ownership of observability platforms for reliability, scalability, and continued evolution, with an accent on ELK (Elasticsearch, Logstash, Kibana), Grafana, and incident-driven operations. Focus on maintaining SLOs, reducing toil through automation, and modernizing platform components using infrastructure-as-code.

Location: Austin

Company

hirify.global builds and operates platforms that support engineering visibility and reliability.

What you will do

Act as the primary escalation point for production support across the ELK Stack, Grafana, and New Relic.
Own platform health, capacity planning, and performance tuning for on-premises observability infrastructure (Elasticsearch cluster management, index lifecycle, retention).
Monitor and maintain SLOs, and support engineering onboarding with instrumentation, dashboards, and alert definitions.
Manage patching, upgrades, and configuration management across the observability stack, including collaboration with security on hardening and vulnerability management.
Contribute to platform engineering by designing and building automation/tooling to reduce toil and improve platform experience.
Develop and maintain infrastructure-as-code (Terraform, Helm, Ansible, etc.) and help standardize logging/metrics/alerting practices at scale.

Requirements

5+ years of experience in SRE, DevOps, or platform engineering roles.
Deep hands-on experience with the ELK Stack, including Elasticsearch cluster operations, Logstash pipeline development, Kibana, and index lifecycle management.
Strong experience with Grafana, including data source integrations, dashboard design, and alerting.
Solid understanding of observability principles and experience operating on-premises infrastructure (capacity planning and operational tradeoffs vs managed cloud).
Proficiency in Python for automation and tooling, plus familiarity with shell scripting.
Strong Linux systems knowledge and comfort with configuration management tools (e.g., Ansible, Chef, Puppet).

Culture & Benefits

Hybrid role with roughly half the time on steady-state operations and platform support, and half on engineering projects.
Benefits and educational initiatives, plus special celebrations of company history, culture, and growth.
Equal opportunity employer.

Hiring process

Interviews focused on SRE/observability ownership, incident resolution, and platform modernization/automation experience.
Discussion of how experience maps to ELK/Grafana operations and infrastructure-as-code practices.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →