TL;DR
Site Reliability Engineer Lead (DevOps): Establishing and institutionalizing enterprise-grade SRE practices and observability for business-critical applications with an accent on defining SLOs, implementing monitoring stacks, and leading incident response. Focus on architecting resilient systems, driving operational excellence, and ensuring best-in-class uptime and customer experience for Gold and SME platforms.
Location: Remote from India (due to regulatory requirements like RBI, CERT-IN, DPDP Act)
Company
hirify.global is seeking an SRE Lead Engineer to establish enterprise-grade Site Reliability Engineering practice within IIFL Finance's platforms.
What you will do
- Define and institutionalize the SRE charter, policies, and operating model across business-critical applications.
- Design and implement service level objectives (SLOs), service level indicators (SLIs), and error budgets.
- Architect and implement an enterprise observability stack across applications, databases, networks, and hybrid infrastructure.
- Lead initiatives for capacity planning, chaos engineering, failover testing, and resilience validation.
- Collaborate with application, DevSecOps, security, and infrastructure teams to embed SRE practices in the SDLC.
- Build and lead a small team of SRE engineers.
Requirements
- 7+ years of hands-on experience in hyper-scale services (e.g., AWS, AKS, Azure Monitor) and on-prem workloads.
- Expert-level knowledge of logging, metrics (e.g., Datadog, AppDynamics, Prometheus/Grafana), tracing, and incident analytics at scale.
- Proficiency in Python, PowerShell, Ansible, Terraform, and CI/CD integration.
- Strong knowledge of microservices, containers (Kubernetes, Docker), message queues, and databases.
- Proven ability to lead incident response, perform RCA, and design proactive reliability measures.
- Understanding of Indian regulatory requirements (e.g., RBI, CERT-IN, DPDP Act) is required.
Culture & Benefits
- Prioritize stability, resilience, and uptime while balancing innovation and delivery speed.
- Embrace data-driven decision making, iterative enhancements, and blameless postmortems.
- Work seamlessly with application, DevSecOps, and infrastructure teams to align goals.
- Focus on customer-centric reliability, framing SLIs in terms of business impact.
- Opportunity to define roadmap, mentor junior SREs, and drive enterprise-wide adoption of best practices.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →