TL;DR
Site Reliability Engineer Lead (DevOps): Establishing the enterprise-grade Site Reliability Engineering (SRE) practice, setting the vision, frameworks, and execution model for reliability, observability, and operational excellence across platforms. Focus on building and leading a small team of SRE engineers, collaborating with DevSecOps, architecture, and infrastructure teams, and ensuring platforms achieve best-in-class uptime and resiliency.
Location: Onsite in Bengaluru
Company
hirify.global is a company in the software domain.
What you will do
- Define and institutionalize the SRE charter, policies, and operating model across business-critical applications.
- Design and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
- Create playbooks for incident response, escalation, and blameless postmortems.
- Architect and implement an enterprise observability stack across applications, databases, networks, and cloud/on-prem infrastructure.
- Lead initiatives for capacity planning, chaos engineering, failover testing, and resilience validation.
- Collaborate with application, DevSecOps, security, and infrastructure teams to embed SRE practices into the SDLC.
Requirements
- Strong hands-on experience in hyperscaler services and on-prem workloads.
- Expert-level knowledge of leading tools including configuration, agent deployment, instrumentation, and dashboard building.
- Proficiency in Python, PowerShell, Ansible, Terraform, and CI/CD integration.
- Knowledge of microservices, containers (Kubernetes, Docker), message queues, and databases.
- Proven ability to lead incident response, perform RCA, and design proactive reliability measures.
- Understanding of regulatory requirements and embedding compliance into monitoring and observability frameworks.
Culture & Benefits
- Always prioritizes stability, resilience, and uptime while balancing innovation and delivery speed.
- Data-driven decision making using metrics, dashboards, and SLIs to guide prioritization, escalation, and improvements.
- Embraces iterative enhancements, blameless postmortems, and learning from failures.
- Works seamlessly with application, DevSecOps, infrastructure, and SI/vendor teams to align goals and drive SRE adoption.
- Customer-centric reliability mindset, framing SLOs in terms of customer/business impact, not just system metrics.
- Demonstrates calm, structured approach during incidents and high-severity outages.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →