TL;DR

Site Reliability Engineer Lead (DevOps): Establishing the enterprise-grade Site Reliability Engineering (SRE) practice, setting the vision, frameworks, and execution model for reliability, observability, and operational excellence across platforms. Focus on building and leading a small team of SRE engineers, collaborating with DevSecOps, architecture, and infrastructure teams, and ensuring platforms achieve best-in-class uptime and resiliency.

Location: Onsite in Bengaluru

Company

%hirify_global% is a company in the software domain.

What you will do

Define and institutionalize the SRE charter, policies, and operating model across business-critical applications.
Design and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
Create playbooks for incident response, escalation, and blameless postmortems.
Architect and implement an enterprise observability stack across applications, databases, networks, and cloud/on-prem infrastructure.
Lead initiatives for capacity planning, chaos engineering, failover testing, and resilience validation.
Collaborate with application, DevSecOps, security, and infrastructure teams to embed SRE practices into the SDLC.

Requirements

Strong hands-on experience in hyperscaler services and on-prem workloads.
Expert-level knowledge of leading tools including configuration, agent deployment, instrumentation, and dashboard building.
Proficiency in Python, PowerShell, Ansible, Terraform, and CI/CD integration.
Knowledge of microservices, containers (Kubernetes, Docker), message queues, and databases.
Proven ability to lead incident response, perform RCA, and design proactive reliability measures.
Understanding of regulatory requirements and embedding compliance into monitoring and observability frameworks.

Culture & Benefits

Always prioritizes stability, resilience, and uptime while balancing innovation and delivery speed.
Data-driven decision making using metrics, dashboards, and SLIs to guide prioritization, escalation, and improvements.
Embraces iterative enhancements, blameless postmortems, and learning from failures.
Works seamlessly with application, DevSecOps, infrastructure, and SI/vendor teams to align goals and drive SRE adoption.
Customer-centric reliability mindset, framing SLOs in terms of customer/business impact, not just system metrics.
Demonstrates calm, structured approach during incidents and high-severity outages.