Назад
Company hidden
6 дней назад

Site Reliability Engineer Lead (DevOps)

Формат работы
onsite
Тип работы
fulltime
Грейд
lead
Английский
b2
Страна
India
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer Lead (DevOps): Establishing the enterprise-grade Site Reliability Engineering (SRE) practice, setting the vision, frameworks, and execution model for reliability, observability, and operational excellence across platforms. Focus on building and leading a small team of SRE engineers, collaborating with DevSecOps, architecture, and infrastructure teams, and ensuring platforms achieve best-in-class uptime and resiliency.

Location: Onsite in Bengaluru

Company

hirify.global is a company in the software domain.

What you will do

  • Define and institutionalize the SRE charter, policies, and operating model across business-critical applications.
  • Design and implement Service Level Objectives (SLOs), Service Level Indicators (SLIs), and error budgets.
  • Create playbooks for incident response, escalation, and blameless postmortems.
  • Architect and implement an enterprise observability stack across applications, databases, networks, and cloud/on-prem infrastructure.
  • Lead initiatives for capacity planning, chaos engineering, failover testing, and resilience validation.
  • Collaborate with application, DevSecOps, security, and infrastructure teams to embed SRE practices into the SDLC.

Requirements

  • Strong hands-on experience in hyperscaler services and on-prem workloads.
  • Expert-level knowledge of leading tools including configuration, agent deployment, instrumentation, and dashboard building.
  • Proficiency in Python, PowerShell, Ansible, Terraform, and CI/CD integration.
  • Knowledge of microservices, containers (Kubernetes, Docker), message queues, and databases.
  • Proven ability to lead incident response, perform RCA, and design proactive reliability measures.
  • Understanding of regulatory requirements and embedding compliance into monitoring and observability frameworks.

Culture & Benefits

  • Always prioritizes stability, resilience, and uptime while balancing innovation and delivery speed.
  • Data-driven decision making using metrics, dashboards, and SLIs to guide prioritization, escalation, and improvements.
  • Embraces iterative enhancements, blameless postmortems, and learning from failures.
  • Works seamlessly with application, DevSecOps, infrastructure, and SI/vendor teams to align goals and drive SRE adoption.
  • Customer-centric reliability mindset, framing SLOs in terms of customer/business impact, not just system metrics.
  • Demonstrates calm, structured approach during incidents and high-severity outages.

Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →