TL;DR
Senior Site Reliability Engineer: Maintaining and evolving AWS and Kubernetes infrastructure for Python-based AI services with an accent on platform reliability, developer experience, and infrastructure as code. Focus on migrating services to Kubernetes, improving CI/CD pipelines, and owning observability efforts.
Location: This is a hybrid role based in Barcelona, Spain. Relocation support is provided for you and your family.
Company
Manychat is building a leading Chat Marketing platform used by over 1.5 million customers worldwide, focusing on Instagram, Messenger, WhatsApp, and TikTok automations.
What you will do
- Maintain and harden AWS infrastructure (EC2, ALB/NLB, WAF, IAM, CloudWatch)
- Operate and evolve EKS clusters powering Python-based AI services
- Migrate existing services to Kubernetes using Terraform and Helm
- Codify infrastructure with Terraform and manage host-level automation via Ansible
- Build and improve CI/CD pipelines with GitHub Actions
- Own observability efforts: Prometheus, Grafana, alerting, and on-call readiness
- Support OS-level patching, certs, WAF rules, and general infra hygiene
- Partner with engineers to guide best practices and drive platform reliability
- Create clean, maintainable infrastructure documentation and playbooks
- Occasionally support rare off-hours incidents
Requirements
- 5+ years of experience managing Linux in production (Ubuntu, Amazon Linux)
- Strong experience with Kubernetes (ideally EKS), Helm, and Terraform
- Comfort with running and debugging Python workloads in containers
- Solid understanding of networking, IAM, and cloud security best practices
- Hands-on Nginx experience (Ingress and reverse proxy setups)
- Excellent communication skills to explain complex infrastructure to developers clearly
Nice to have
- Strong Ansible skills beyond the basics
- PostgreSQL or Amazon RDS tuning and operations experience
- Deep understanding of observability tools (Prometheus, Grafana, Loki, etc.)
- Familiarity with PHP production environments
- Experience with TDD, CI/CD best practices, and agile development
- Any previous SRE-like exposure such as building resilience, automation, or incident tooling
Culture & Benefits
- Hybrid onboarding to start work remotely, with relocation support for you and your family
- Comprehensive health insurance for both you and your family
- Professional development budget for conference tickets, online courses, and other relevant resources
- Flexible benefits package to tailor perks that matter most for you
- Hybrid work and generous leave options to prioritize work-life balance
- In-office perks, including free meals and snacks
- Company-funded sport activities, annual offsites, and team-building events
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →