TL;DR

Senior Site Reliability Engineer: Designing, implementing, and evolving large-scale, cloud-native infrastructure for a global SaaS platform with an accent on reliability and scalability initiatives, infrastructure-as-code, and GitOps practices. Focus on proactively identifying systemic reliability issues, leading major incident response, and integrating reliability principles into the development lifecycle.

Location: Remote work is restricted to Malaysia only. %hirify_global% does not sponsor work visas or relocation.

Company

%hirify_global% is a leading database company making a significant impact globally, providing the backbone for applications used daily by companies, including 75% of the Fortune 500.

What you will do

Design, implement, and evolve large-scale, cloud-native infrastructure supporting a global SaaS platform.
Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps.
Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance.
Collaborate with software and platform teams to integrate reliability principles, SLOs, and observability standards.
Act as a key technical leader during major incidents, coordinating response efforts and conducting root cause analysis.
Contribute to continuous improvement by defining infrastructure patterns and mentoring other engineers.

Requirements

At least 7 years of hands-on experience as an SRE, DevOps, or Infrastructure Engineer in production cloud environments.
Strong expertise with Kubernetes operations and ecosystem tooling in production-scale clusters.
Proven experience designing and maintaining multi-cloud infrastructure across Azure, AWS, or GCP.
Advanced proficiency with Terraform and Terragrunt for designing modular, reusable, and secure IaC components.
Solid understanding of GitOps principles and deployment automation using ArgoCD or similar tools.
Deep experience with Linux systems administration, performance tuning, and troubleshooting.
Proficiency in one or more programming/scripting languages (Python, Bash, Go preferred).
Strong understanding of observability concepts and experience with Prometheus, Grafana, and Thanos.
Experience participating in or leading on-call rotations, handling incident response, and conducting post-incident reviews.
English: B2 required.
Work authorization for Malaysia is required, as %hirify_global% does not sponsor work visas.

Nice to have

Hands-on experience with large-scale multi-cloud Kubernetes clusters or hybrid cloud setups.
Experience building or operating self-healing and auto-scaling systems.
Certification in one or more major cloud providers (Azure, AWS, or GCP) or Kubernetes (CKA/CKAD).
Contributions to open-source projects, reliability frameworks, or IaC tooling.
Experience mentoring teams and promoting engineering excellence across an organization.

Culture & Benefits

Impact the world of technology by pushing boundaries in technology and business models.
Collaborate with high-caliber colleagues around the world, offering unparalleled learning and growth opportunities.
Provide a very competitive compensation package and 25 days paid annual leave (plus holidays).
Offer a massive degree of flexibility and freedom.
%hirify_global% is an equal opportunity employer dedicated to creating a welcoming and inclusive workplace for everyone.