Site Reliability Engineer Lead (DevOps)

Формат работы

remote (только India)

Тип работы

fulltime

Грейд

lead

Английский

Страна

India

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer Lead (DevOps): Establishing and institutionalizing enterprise-grade SRE practices and observability for business-critical applications with an accent on defining SLOs, implementing monitoring stacks, and leading incident response. Focus on architecting resilient systems, driving operational excellence, and ensuring best-in-class uptime and customer experience for Gold and SME platforms.

Location: Remote from India (due to regulatory requirements like RBI, CERT-IN, DPDP Act)

Company

hirify.global is seeking an SRE Lead Engineer to establish enterprise-grade Site Reliability Engineering practice within IIFL Finance's platforms.

What you will do

Define and institutionalize the SRE charter, policies, and operating model across business-critical applications.
Design and implement service level objectives (SLOs), service level indicators (SLIs), and error budgets.
Architect and implement an enterprise observability stack across applications, databases, networks, and hybrid infrastructure.
Lead initiatives for capacity planning, chaos engineering, failover testing, and resilience validation.
Collaborate with application, DevSecOps, security, and infrastructure teams to embed SRE practices in the SDLC.
Build and lead a small team of SRE engineers.

Requirements

7+ years of hands-on experience in hyper-scale services (e.g., AWS, AKS, Azure Monitor) and on-prem workloads.
Expert-level knowledge of logging, metrics (e.g., Datadog, AppDynamics, Prometheus/Grafana), tracing, and incident analytics at scale.
Proficiency in Python, PowerShell, Ansible, Terraform, and CI/CD integration.
Strong knowledge of microservices, containers (Kubernetes, Docker), message queues, and databases.
Proven ability to lead incident response, perform RCA, and design proactive reliability measures.
Understanding of Indian regulatory requirements (e.g., RBI, CERT-IN, DPDP Act) is required.

Culture & Benefits

Prioritize stability, resilience, and uptime while balancing innovation and delivery speed.
Embrace data-driven decision making, iterative enhancements, and blameless postmortems.
Work seamlessly with application, DevSecOps, and infrastructure teams to align goals.
Focus on customer-centric reliability, framing SLIs in terms of business impact.
Opportunity to define roadmap, mentor junior SREs, and drive enterprise-wide adoption of best practices.