Назад
Company hidden
1 день назад

Manager, Site Reliability Engineering

Формат работы
hybrid
Тип работы
fulltime
Грейд
lead
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Manager, Site Reliability Engineering (SRE/DevOps): Lead a team of Site Reliability Engineers to maintain reliability, scalability, and performance of hirify.global systems with an accent on multi-cloud reliability strategy, incident response, and automation. Focus on building SLO/SLI/SLA practices, improving observability and deployment processes, and driving infrastructure resilience (high availability and disaster recovery) with continuous learning through RCA and post-mortems.

Location: BGR Sofia (Hybrid)

Company

hirify.global is a travel technology company powering intelligent offer and revenue optimization for airlines.

What you will do

  • Lead and mentor the SRE team, driving reliability, accountability, and continuous improvement.
  • Develop and implement strategies for multi-cloud reliability, monitoring, and incident response.
  • Drive automation for deployment processes, infrastructure as code (IaC), and operational efficiency.
  • Manage observability tooling for logging, metrics, and alerting; establish SLOs/SLIs/SLAs.
  • Oversee root cause analysis (RCA) and post-mortems to improve systems and processes.
  • Ensure high availability and disaster recovery strategies are in place and regularly tested; optimize cloud infrastructure costs.

Requirements

  • 7+ years of experience in software engineering, SRE, or DevOps, including 3+ years in a managerial or leadership role.
  • Strong cloud platform knowledge (Azure, AWS, IBM Cloud) and containerization (Docker, Kubernetes).
  • Proficiency with automation and configuration management tools (Terraform, Ansible, Puppet, The Foreman).
  • Experience with monitoring and observability tools (e.g., Prometheus, Grafana, PagerDuty, Graylog).
  • Solid programming/scripting skills in Python, Go, Bash, or similar languages.
  • Expertise in CI/CD pipelines and modern deployment strategies; strong analytical and problem-solving skills.

Nice to have

  • Experience with large-scale distributed systems.
  • Knowledge of networking, security, and compliance best practices.
  • Experience with incident response and ITIL framework.
  • Background in high-availability, customer-facing production environments.

Culture & Benefits

  • Flexible ways of working with a hybrid setup.
  • Culture focused on ownership, innovation, and care.
  • Continuous learning and support to grow and innovate.
  • Collaboration between software development and operations teams.

Hiring process

  • Interviews to assess leadership, SRE/DevOps experience, and technical depth across reliability, automation, and observability.
  • Discussion of collaboration approach and experience improving production reliability through incident management and RCA.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →