Senior Site Reliability Engineer (AWS)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (AWS): Designing and scaling fault-tolerant, self-healing systems for a global eSIM platform with an accent on observability, automation, and multi-region AWS infrastructure. Focus on eliminating manual toil, optimizing system reliability through chaos engineering, and refining the on-call experience.
Location: Fully remote, but must be based in Spain
Company
is a global eSIM platform that enables millions of travellers to connect instantly across the globe through high-scale systems and carrier integrations.
What you will do
- Lead the design of scalable, fault-tolerant, and self-healing systems within a multi-region AWS environment.
- Define and track SLOs and SLIs to guide architectural decisions and manage error budget policies.
- Develop internal tools and automation to permanently eliminate manual operational tasks.
- Implement deep observability using Prometheus, Datadog, and OpenTelemetry to generate proactive insights.
- Identify and mitigate operational risks through chaos engineering and rigorous architecture reviews.
- Collaborate with software engineers to ensure reliability, scalability, and maintainability from the early stages of the SDLC.
Requirements
- Bachelor’s degree in Computer Engineering or a similar discipline.
- 5+ years of experience as a Site Reliability Engineer or in a similar role.
- 3+ years of experience with AWS services and 2+ years of experience with Kubernetes.
- Proficiency in at least one programming language such as Python, Go, or Java for automation.
- Experience with Terraform (IaC), GitHub Actions (CI/CD), and event-driven architectures (SNS, SQS).
- Fluency in English is required.
- Must be based in Spain.
Nice to have
- Prior experience with Telco Core Networks (5G/LTE, IMS, Signaling) and eSIM/GSMA technologies.
- Relevant certifications such as AWS Certified DevOps Engineer or CKA.
- Experience with AI-driven SRE tools for anomaly detection.
- Contributions to open-source SRE projects or communities.
- Experience with Scrum and other agile methods.
Culture & Benefits
- Environment emphasizing real ownership, autonomy, and a direct link between shipping and business outcomes.
- Blameless culture focused on learning from mistakes and knowledge sharing.
- Fully remote work arrangement within the Spanish team.
- Collaborative atmosphere working with smart, motivated engineers at global scale.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →