Senior Site Reliability Engineer (Telecom)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (DevOps/Telecom): Strengthening the stability, scalability, and reliability of global infrastructure and services across cloud and on-prem environments with an accent on SLIs/SLOs, redundancy testing, and automated recovery. Focus on building self-healing workflows, enhancing observability with OpenTelemetry and Prometheus, and reducing operational toil.
Location: Must be based in Berlin, Germany
Company
A technology-driven global mobile communications provider delivering connectivity solutions via an in-house eSIM platform and core network.
What you will do
- Define, measure, and maintain SLIs and SLOs for core infrastructure and customer-facing services.
- Plan and execute redundancy and resilience testing across service, infrastructure, and networking layers.
- Design and implement automated recovery mechanisms, self-healing workflows, and intelligent alerting systems.
- Drive incident response, root-cause analysis, and blameless post-mortems to ensure continuous improvement.
- Enhance observability using Prometheus, Grafana, Loki, and OpenTelemetry.
- Contribute to cloud cost-optimization and perform capacity planning and resilience audits.
Requirements
- Minimum 5 years of experience in Site Reliability, Systems, or Infrastructure Engineering, with 2+ years in a dedicated SRE role.
- Strong expertise in Linux systems engineering, distributed systems, and networking (BGP, DNS, routing, load balancing).
- Hands-on experience with Kubernetes, container orchestration, and service mesh architectures.
- Proficiency in Python, Go, and Bash for automation and reliability tooling.
- Experience with AWS (EKS, EC2, VPC) and Infrastructure as Code tools like Terraform.
- Must be based in or be able to work from Berlin, Germany.
Nice to have
- Experience in telecom or carrier-grade large-scale distributed systems environments.
- Hands-on experience with chaos engineering and automated failure-scenario validation.
- Background in capacity planning, traffic engineering, and multi-region failover.
- Familiarity with security and resilience standards such as ISO 27001 or NIST SP 800-53.
Culture & Benefits
- Rapid career growth in a company expanding over 100% year-on-year.
- High-impact exposure to transactions shaping the future of the telco industry.
- Collaboration with a talented international team and renowned external advisors.
- Opportunities to work in different offices worldwide.
- Supportive and transparent environment with an open communication culture.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →