Назад
Company hidden
3 дня назад

Site Reliability Engineer (SRE) (Observability)

Тип работы
fulltime
Английский
b2
Страна
Bulgaria
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (Observability): Maintaining the reliability, performance, and operational integrity of enterprise-grade infrastructure and observability pipelines with an accent on Grafana, Loki, Prometheus, and AI-driven automation. Focus on building automation and AI workflows for incident analysis, optimizing SLIs/SLOs, and operating large-scale distributed systems.

Location: Sofia, Bulgaria

Company

hirify.global is the first AI-driven digital work platform, providing integrated solutions for Unified Endpoint Management, Virtual Apps, and Security to support flexible, secure work-from-anywhere experiences.

What you will do

  • Design, deploy, and maintain observability pipelines using Loki, Grafana, and Prometheus to expand logging, metrics, and tracing coverage.
  • Build and refine AI-driven automation workflows for incident analysis and auto-remediation.
  • Drive platform reliability through capacity planning, performance optimization, and root cause analysis based on SLIs/SLOs.
  • Participate in a global on-call rotation to manage incidents and lead post-mortem reviews.
  • Operate and improve internal clouds, including vCF, CloudStack, Proxmox, and Kubernetes clusters.
  • Utilize Atlassian tools (Jira, Confluence, Opsgenie) for task, change, and incident management.

Requirements

  • Hands-on expertise with Grafana, Loki, Tempo, and Prometheus.
  • Proficiency in at least one scripting or programming language.
  • Experience with configuration management tools such as Ansible or SaltStack.
  • Strong Linux skills and experience operating large-scale, highly available distributed systems.
  • Familiarity with Kubernetes, CI/CD, and Infrastructure as Code (IaC).
  • Ability to participate in on-call rotations and take leadership during incidents.

Nice to have

  • Exposure to AI orchestration tooling like Ollama or n8n.
  • Experience with S3 or open-source object stores such as Ceph or SeaweedFS.
  • Knowledge of virtualization stacks including Proxmox, vSphere/vCF, and CloudStack.
  • Background in SRE culture, specifically SLIs/SLOs and error budgeting.

Culture & Benefits

  • Work within an AI-driven environment focusing on autonomous workspaces and operational efficiency.
  • Culture guided by values of trust, inclusiveness, and maximizing customer value.
  • Commitment to a diverse and merit-based workforce with equal opportunities for all.
  • Exposure to cutting-edge AI tools for incident diagnosis and platform operations.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →