Site Reliability Engineer (SRE) (Observability)

Тип работы

fulltime

Английский

Страна

Bulgaria

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (Observability): Maintaining the reliability, performance, and operational integrity of enterprise-grade infrastructure and observability pipelines with an accent on Grafana, Loki, Prometheus, and AI-driven automation. Focus on building automation and AI workflows for incident analysis, optimizing SLIs/SLOs, and operating large-scale distributed systems.

Location: Sofia, Bulgaria

Company

hirify.global is the first AI-driven digital work platform, providing integrated solutions for Unified Endpoint Management, Virtual Apps, and Security to support flexible, secure work-from-anywhere experiences.

What you will do

Design, deploy, and maintain observability pipelines using Loki, Grafana, and Prometheus to expand logging, metrics, and tracing coverage.
Build and refine AI-driven automation workflows for incident analysis and auto-remediation.
Drive platform reliability through capacity planning, performance optimization, and root cause analysis based on SLIs/SLOs.
Participate in a global on-call rotation to manage incidents and lead post-mortem reviews.
Operate and improve internal clouds, including vCF, CloudStack, Proxmox, and Kubernetes clusters.
Utilize Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management.

Requirements

Hands-on expertise with Grafana, Loki, Tempo, and Prometheus.
Proficiency in at least one scripting or programming language.
Experience with configuration management tools such as Ansible or SaltStack.
Strong Linux skills and experience operating large-scale, highly available distributed systems.
Familiarity with Kubernetes, CI/CD, and Infrastructure as Code (IaC).
Ability to participate in on-call rotations and take leadership during incidents.

Nice to have

Exposure to AI orchestration tooling like Ollama or n8n.
Experience with S3 or open-source object stores such as Ceph or SeaweedFS.
Knowledge of virtualization stacks including Proxmox, vSphere/vCF, and CloudStack.
Background in SRE culture, specifically SLIs/SLOs and error budgeting.

Culture & Benefits

Work within an AI-driven environment focusing on autonomous workspaces and operational efficiency.
Culture guided by values of trust, inclusiveness, and maximizing customer value.
Commitment to a diverse and merit-based workforce with equal opportunities for all.
Exposure to cutting-edge AI tools for incident diagnosis and platform operations.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Site Reliability Engineer (SRE) (Observability)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Senior Site Reliability Engineer (SRE)

Site Reliability Engineer (SRE)

Site Reliability Engineer (Observability)

Principal Site Reliability Engineer (AI)

Staff Observability Platform Engineer (AI)

Software Engineer - Site Reliability (SRE)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business