Lead Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Lead Site Reliability Engineer (SRE): Lead reliability engineering across 's global SaaS platform as it scales and shifts toward an AI-first operating model with an accent on building an automation-first reliability ecosystem. Focus on designing self-healing systems, advancing observability, modernising deployment practices, and integrating AI-driven operations to improve stability, reduce operational risk, and enable faster, safer delivery.
Location: Romania
Company
builds a cloud compliance platform and operates an AI-first approach across workflows, decision-making, and products.
What you will do
- Own and evolve the reliability strategy for distributed SaaS systems across multi-cloud platforms.
- Design and implement AI-driven operations (predictive monitoring, anomaly detection, automated root cause analysis).
- Build and scale observability using Prometheus, Grafana, and OpenTelemetry.
- Create self-healing systems and automation frameworks to reduce manual operational work.
- Improve deployment practices with feature flags, progressive delivery, and safe rollout strategies.
- Ensure reliability and performance of CI/CD pipelines and infrastructure as code; lead incident response and recovery improvements.
Requirements
- 10+ years of experience in SaaS, distributed systems, or site reliability engineering.
- Programming skills in Go, Java, or Python.
- Deep experience with observability tools: Prometheus, Grafana, and OpenTelemetry.
- Hands-on experience with Kubernetes, containerisation, and multi-cloud platforms (AWS, GCP, Azure, or OCI).
- Strong understanding of Linux systems, networking, and cloud-native architectures.
- Proven ability to design automation, improve system reliability, and apply AI or machine learning to operational workflows.
Culture & Benefits
- Paid time off and paid parental leave; bonuses may be available.
- Health & wellness benefits including private medical, life, and disability insurance (varies by location).
- Inclusive culture with employee-run resource groups and senior leadership/executive sponsorship.
- AI-first environment where AI is embedded in workflows, decision-making, and products.
Hiring process
- Application via the careers portal.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →