Site Reliability Engineer (SRE) (Observability)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (Observability): Maintaining the reliability, performance, and operational integrity of enterprise-grade infrastructure and observability pipelines with an accent on Grafana, Loki, Prometheus, and AI-driven automation. Focus on building automation and AI workflows for incident analysis, optimizing SLIs/SLOs, and operating large-scale distributed systems.
Location: Sofia, Bulgaria
Company
is the first AI-driven digital work platform, providing integrated solutions for Unified Endpoint Management, Virtual Apps, and Security to support flexible, secure work-from-anywhere experiences.
What you will do
- Design, deploy, and maintain observability pipelines using Loki, Grafana, and Prometheus to expand logging, metrics, and tracing coverage.
- Build and refine AI-driven automation workflows for incident analysis and auto-remediation.
- Drive platform reliability through capacity planning, performance optimization, and root cause analysis based on SLIs/SLOs.
- Participate in a global on-call rotation to manage incidents and lead post-mortem reviews.
- Operate and improve internal clouds, including vCF, CloudStack, Proxmox, and Kubernetes clusters.
- Utilize Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management.
Requirements
- Hands-on expertise with Grafana, Loki, Tempo, and Prometheus.
- Proficiency in at least one scripting or programming language.
- Experience with configuration management tools such as Ansible or SaltStack.
- Strong Linux skills and experience operating large-scale, highly available distributed systems.
- Familiarity with Kubernetes, CI/CD, and Infrastructure as Code (IaC).
- Ability to participate in on-call rotations and take leadership during incidents.
Nice to have
- Exposure to AI orchestration tooling like Ollama or n8n.
- Experience with S3 or open-source object stores such as Ceph or SeaweedFS.
- Knowledge of virtualization stacks including Proxmox, vSphere/vCF, and CloudStack.
- Background in SRE culture, specifically SLIs/SLOs and error budgeting.
Culture & Benefits
- Work within an AI-driven environment focusing on autonomous workspaces and operational efficiency.
- Culture guided by values of trust, inclusiveness, and maximizing customer value.
- Commitment to a diverse and merit-based workforce with equal opportunities for all.
- Exposure to cutting-edge AI tools for incident diagnosis and platform operations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →