Назад
Company hidden
6 дней назад

Senior Site Reliability Engineer (Observability)

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (Observability): Responsible for the reliability, scalability, and evolution of on-premises observability platforms including ELK Stack and Grafana with an accent on operations ownership and platform engineering. Focus on managing cluster health, capacity planning, automation tooling, and enforcing observability best practices across engineering teams.

Location: Hybrid in Austin or Charlotte (US)

Company

Asset management firm with comprehensive benefits, educational initiatives, and a focus on equal opportunity.

What you will do

  • Serve as primary escalation point for production support on ELK Stack, Grafana, and New Relic.
  • Own platform health, capacity planning, performance tuning, and SLO monitoring for observability infrastructure.
  • Support engineering teams with onboarding, instrumentation, dashboards, and alerts.
  • Design and build automation tooling to reduce toil and improve platform experience.
  • Lead modernization initiatives like ingestion pipelines, scaling, and standardizing dashboards/alerts.
  • Develop infrastructure-as-code with Terraform, Helm, Ansible and contribute to platform roadmap.

Requirements

  • Bachelor’s degree in technical field or equivalent experience
  • 5+ years in SRE, DevOps, or platform engineering
  • Deep hands-on with ELK Stack: Elasticsearch operations, Logstash pipelines, Kibana, index lifecycle.
  • Strong Grafana experience: data sources, dashboards, alerting.
  • Observability principles, on-premises infrastructure operations, capacity planning.
  • Python proficiency for automation, Linux systems, configuration management (Ansible, etc.).
  • Incident resolution and clear communication under pressure; bias toward automation.

Nice to have

  • Prometheus experience.
  • New Relic administration or APM.
  • Log shipping agents (Beats, Fluentd, Fluent Bit).
  • Distributed tracing (OpenTelemetry).
  • Cloud observability and hybrid strategies.
  • Governing observability standards in large organizations.

Culture & Benefits

  • Hybrid work model balancing operations and engineering projects.
  • On-call rotations with runbooks and escalation procedures.
  • Comprehensive benefits, educational initiatives, and career support programs.
  • Focus on excellence, automation, and relentless platform improvement.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →