Senior Manager, Site Reliability Engineering (SRE)

143 000 - 191 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior/lead

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Manager, Site Reliability Engineering (SRE): Leading the SRE organization to deliver reliable, scalable, and resilient platforms and services with an accent on owning strategy, implementation, and continuous improvement of a unified observability platform. Focus on driving practices around SLIs, SLOs, SLAs, Error Budgets, incident management, and automation while ensuring close collaboration across teams.

Location: Office Location or Remote - USA

Salary: $143,000 - $191,000 plus bonus

Company

hirify.global is a healthcare business and data automation company that empowers healthcare organizations to enable better patient care and maximize industry savings using its cloud-based supply chain technology exchange platform, solutions, analytics, and services.

What you will do

Hire, lead, and mentor a high-performing SRE team across geographies.
Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.
Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, CloudWatch, Prometheus, Grafana, Graylog, and OpenTelemetry.
Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.
Lead major incident response, coordinating communications with executives and stakeholders.
Collaborate with Engineering, Product, Security, Cloud, and DevOps teams to embed SRE practices.

Requirements

12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles.
Proven expertise in unified observability, monitoring, and alerting across infrastructure, applications, APM, and databases.
Strong knowledge of observability tools including New Relic, Datadog, Prometheus, Grafana, Graylog, CloudWatch, OpenTelemetry, and SolarWinds.
Hands-on experience with incident response, RCA, MTTR/MTTD reduction, and on-call management.
Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.
Strong AWS experience (EC2, ECS, EKS, networking, scaling groups) and hands-on experience with Docker and Kubernetes.
Proficiency in Python, Java, C#, and shell scripting for automation.
Strong leadership, stakeholder management, and communication skills.

Nice to have

Experience in large-scale SaaS or product-driven environments.
Hands-on experience with databases: MongoDB, Elasticsearch, SQL Server, Oracle.
Experience with chaos engineering, resiliency testing, and disaster recovery planning.
Certifications: AWS Solutions Architect / DevOps Engineer, CKAD/CKA.
Experience managing global SRE teams across time zones.

Culture & Benefits

Establish a healthy 24x7 on-call model while promoting team well-being.
Drive a blameless culture through structured postmortems and RCA follow-up actions.
Health, vision, and dental insurance.
Accident and life insurance.
401k matching.
Paid-time off and education reimbursement.

Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...