TL;DR
Senior Site Reliability Engineer: Leading the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment with an accent on defining SLOs/SLIs and implementing long-term preventive measures. Focus on developing internal automation tools, deep observability, and proactively mitigating operational risks through chaos engineering.
Location: Remote (global, work-from-anywhere stipend)
Company
hirify.global is the world’s first eSIM store that helps people connect in over 200+ countries and regions across the globe, aiming to revolutionize the telecom industry.
What you will do
- Lead the design of scalable, fault-tolerant, and self-healing systems in a multi-region AWS environment.
- Define and track Service Level Objectives (SLOs) and Service Level Indicators (SLIs) to drive architectural decisions and error budget policies.
- Conduct blameless post-incident reviews to uncover systemic root causes and implement long-term preventive measures.
- Develop internal tools and automation to permanently eliminate patterns of manual work.
- Shift from simple monitoring to deep observability, ensuring high cardinality data leads to proactive actionable insights.
- Proactively identify and mitigate operational risks through chaos engineering and architecture reviews.
- Refine the on-call experience to reduce alert fatigue, improve MTTR, and ensure sustainable rotation health.
Requirements
- Bachelor’s degree in Computer Engineering or a similar discipline.
- 5+ years of experience as a Site Reliability Engineer or in a similar role.
- 3+ years of experience with AWS services, including strong knowledge of container orchestration.
- 2+ years of Kubernetes experience.
- Deep understanding of observability principles and tools like Prometheus, Datadog, or OpenTelemetry.
- Experience with leading incident management and complex postmortem analysis.
- Experience and interest in managing Infrastructure as Code (Terraform) and CI/CD tools such as GitHub Actions.
- Proficiency in at least one programming language (Python, Go, Java, etc.) for building automation and internal tooling.
- Event-driven architecture experience (SNS, SQS etc).
- Good communication skills and fluency in English.
- Participation in on-call rotation is a core expectation of this role, with no duties for the first 6 months.
Nice to have
- Prior experience with Scrum and other agile methods.
- Certification in relevant areas such as AWS Certified DevOps Engineer or Certified Kubernetes Administrator (CKA).
- Prior experience with Telco Core Networks (e.g., 5G/LTE Packet Core, IMS, Signaling) and low-latency networking.
- Experience with AI-driven SRE tools for anomaly detection and improvements.
- Deep understanding of eSIM and GSMA related technologies and services.
Culture & Benefits
- Remote-first environment with a work-from-anywhere stipend.
- Health Insurance, annual wellness & learning credits.
- Annual all-expenses-paid company retreat in a gorgeous destination.
- Company values SRE principles, data-driven decisions, and automation.
- Fosters a blameless culture where everyone is encouraged to learn from mistakes and share knowledge.
- Paid on-call rotation with standby fees + overtime pay, guaranteed rest periods, and flexible hours following night incidents.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →