Senior Site Reliability Engineer

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Spain

Релокация

Spain

Описание вакансии

Текст:

TL;DR

Senior Site Reliability Engineer: Maintaining and evolving AWS and Kubernetes infrastructure for Python-based AI services with an accent on platform reliability, developer experience, and infrastructure as code. Focus on migrating services to Kubernetes, improving CI/CD pipelines, and owning observability efforts.

Location: This is a hybrid role based in Barcelona, Spain. Relocation support is provided for you and your family.

Company

Manychat is building a leading Chat Marketing platform used by over 1.5 million customers worldwide, focusing on Instagram, Messenger, WhatsApp, and TikTok automations.

What you will do

Maintain and harden AWS infrastructure (EC2, ALB/NLB, WAF, IAM, CloudWatch)
Operate and evolve EKS clusters powering Python-based AI services
Migrate existing services to Kubernetes using Terraform and Helm
Codify infrastructure with Terraform and manage host-level automation via Ansible
Build and improve CI/CD pipelines with GitHub Actions
Own observability efforts: Prometheus, Grafana, alerting, and on-call readiness
Support OS-level patching, certs, WAF rules, and general infra hygiene
Partner with engineers to guide best practices and drive platform reliability
Create clean, maintainable infrastructure documentation and playbooks
Occasionally support rare off-hours incidents

Requirements

5+ years of experience managing Linux in production (Ubuntu, Amazon Linux)
Strong experience with Kubernetes (ideally EKS), Helm, and Terraform
Comfort with running and debugging Python workloads in containers
Solid understanding of networking, IAM, and cloud security best practices
Hands-on Nginx experience (Ingress and reverse proxy setups)
Excellent communication skills to explain complex infrastructure to developers clearly

Nice to have

Strong Ansible skills beyond the basics
PostgreSQL or Amazon RDS tuning and operations experience
Deep understanding of observability tools (Prometheus, Grafana, Loki, etc.)
Familiarity with PHP production environments
Experience with TDD, CI/CD best practices, and agile development
Any previous SRE-like exposure such as building resilience, automation, or incident tooling

Culture & Benefits

Hybrid onboarding to start work remotely, with relocation support for you and your family
Comprehensive health insurance for both you and your family
Professional development budget for conference tickets, online courses, and other relevant resources
Flexible benefits package to tailor perks that matter most for you
Hybrid work and generous leave options to prioritize work-life balance
In-office perks, including free meals and snacks
Company-funded sport activities, annual offsites, and team-building events