TL;DR

Site Reliability Engineer (Cybersecurity): Ensuring the reliability, performance, and scalability of a serverless platform handling massive data ingestion and queries with an accent on improving system observability, automating operational tasks, and optimizing resource utilization. Focus on building robust automation, leading incident response, and defining/maintaining stringent SLOs for a critical cybersecurity platform.

Location: Remote-friendly, with offices in Madrid and Barcelona, Spain.

Company

%hirify_global% is a global leader in cybersecurity, protecting modern organizations with the world's most advanced AI-native platform.

What you will do

Own the availability, latency, performance, and efficiency of NG-SIEM platform services handling massive data ingestion and millions of queries per hour.
Design and implement automation solutions for deployment, monitoring, incident response, and capacity planning.
Develop comprehensive observability solutions and proactively identify and resolve performance bottlenecks.
Lead incident response efforts, conduct blameless post-mortems, and drive continuous improvement.
Analyze system performance data and growth trends for capacity planning and efficient scaling.
Define, measure, and maintain Service Level Objectives (SLOs) and error budgets.
Participate in on-call rotation to provide 24/7 support for critical production systems.

Requirements

Experience in Site Reliability Engineering, DevOps, or similar roles supporting large-scale distributed systems.
Strong programming skills in Go for automation and tooling development.
Deep cloud expertise with hands-on experience in AWS or GCP.
Understanding of distributed system design patterns, consistency models, and fault tolerance.
Proficiency with IaC tools (Terraform) and configuration management (Ansible, Chef, Puppet).
Experience with Kubernetes, Docker, Podman, and container-based deployment patterns.
Hands-on experience with monitoring and observability tools (Prometheus, Grafana).
English: B2 required for remote collaboration across global teams.

Nice to have

3+ years owning systems handling over 1 trillion requests per day or more than 10 PB of data per day.
Multi-cloud experience (hybrid or multi-cloud environments).
Deep knowledge of distributed databases, data lakes, or SIEM platforms (ClickHouse, Redis, MySQL).
Exposure to cybersecurity, threat intelligence, or security operations.
Advanced understanding of network protocols, load balancing, and CDN technologies.

Culture & Benefits

Remote-friendly and flexible work culture.
Market leader in compensation and equity awards.
Comprehensive physical and mental wellness programs.
Competitive vacation and holidays for recharge.
Paid parental and adoption leaves.
Professional development opportunities for all employees.