Назад
Company hidden
7 часов назад

Senior Site Reliability Engineer

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
c1
Страна
UK, Spain
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer: Leading the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment with an accent on defining SLOs/SLIs and implementing long-term preventive measures. Focus on developing internal automation tools, deep observability, and proactively mitigating operational risks through chaos engineering.

Location: Remote (global, work-from-anywhere stipend)

Company

hirify.global is the world’s first eSIM store that helps people connect in over 200+ countries and regions across the globe, aiming to revolutionize the telecom industry.

What you will do

  • Lead the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment.
  • Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
  • Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
  • Develop internal tools and automation to permanently eliminate patterns of manual work.
  • Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
  • Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
  • Refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.

Requirements

  • Bachelor’s degree in Computer Engineering or a similar discipline.
  • 5+ years of experience as a Site Reliability Engineer or in a similar role.
  • 3+ years of experience with AWS services, including strong knowledge of container orchestration.
  • 2+ years of Kubernetes experience.
  • Deep understanding of observability principles and tools like Prometheus, Datadog, or OpenTelemetry.
  • Experience with leading incident management and complex postmortem analysis.
  • Experience and interest in managing Infrastructure as Code (Terraform) and CI/CD tools such as GitHub Actions.
  • Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
  • Event-driven architecture experience (SNS, SQS etc).
  • Good communication skills and fluency in English.
  • Participation in on-call rotation is a core expectation of this role, with no duties for the first 6 months.

Nice to have

  • Prior experience with Scrum and other agile methods.
  • Certification in relevant areas such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA).
  • Prior experience with Telco Core Networks (e.g., 5G/LTE Packet Core, IMS, Signaling) and low-latency networking.
  • Experience with AI-driven SRE tools for anomaly detection and improvements.
  • Deep understanding of eSIM and GSMA related technologies and services.

Culture & Benefits

  • Remote-first environment with a work-from-anywhere stipend.
  • Health Insurance, annual wellness & learning credits.
  • Annual all-expenses-paid company retreat in a gorgeous destination.
  • Company values SRE principles, data-driven decisions, and automation.
  • Fosters a blameless culture where everyone is encouraged to learn from mistakes and share knowledge.
  • Paid on-call rotation with standby fees + overtime pay, guaranteed rest periods, and flexible hours following night incidents.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...