TL;DR
Senior Manager, Site Reliability Engineering (SRE): Leading the SRE organization to deliver reliable, scalable, and resilient platforms and services with an accent on owning strategy, implementation, and continuous improvement of a unified observability platform. Focus on driving practices around SLIs, SLOs, SLAs, Error Budgets, incident management, and automation while ensuring close collaboration across teams.
Location: Office Location or Remote - USA
Salary: $143,000 - $191,000 plus bonus
Company
hirify.global is a healthcare business and data automation company that empowers healthcare organizations to enable better patient care and maximize industry savings using its cloud-based supply chain technology exchange platform, solutions, analytics, and services.
What you will do
- Hire, lead, and mentor a high-performing SRE team across geographies.
- Define and execute the SRE vision, roadmap, and strategy in alignment with business and engineering objectives.
- Build and manage a unified observability platform leveraging tools such as New Relic, Datadog, CloudWatch, Prometheus, Grafana, Graylog, and OpenTelemetry.
- Define and manage SLIs, SLOs, SLAs, and Error Budgets across services.
- Lead major incident response, coordinating communications with executives and stakeholders.
- Collaborate with Engineering, Product, Security, Cloud, and DevOps teams to embed SRE practices.
Requirements
- 12+ years of experience in SRE, Operations, or Infrastructure Engineering, with 5+ years in leadership roles.
- Proven expertise in unified observability, monitoring, and alerting across infrastructure, applications, APM, and databases.
- Strong knowledge of observability tools including New Relic, Datadog, Prometheus, Grafana, Graylog, CloudWatch, OpenTelemetry, and SolarWinds.
- Hands-on experience with incident response, RCA, MTTR/MTTD reduction, and on-call management.
- Deep understanding of SLIs, SLOs, SLAs, and Error Budgets.
- Strong AWS experience (EC2, ECS, EKS, networking, scaling groups) and hands-on experience with Docker and Kubernetes.
- Proficiency in Python, Java, C#, and shell scripting for automation.
- Strong leadership, stakeholder management, and communication skills.
Nice to have
- Experience in large-scale SaaS or product-driven environments.
- Hands-on experience with databases: MongoDB, Elasticsearch, SQL Server, Oracle.
- Experience with chaos engineering, resiliency testing, and disaster recovery planning.
- Certifications: AWS Solutions Architect / DevOps Engineer, CKAD/CKA.
- Experience managing global SRE teams across time zones.
Culture & Benefits
- Establish a healthy 24x7 on-call model while promoting team well-being.
- Drive a blameless culture through structured postmortems and RCA follow-up actions.
- Health, vision, and dental insurance.
- Accident and life insurance.
- 401k matching.
- Paid-time off and education reimbursement.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ Π²Π°Ρ ΠΏΡΠΎΡΡΡ Π²ΠΎΠΉΡΠΈ Π² iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β