TL;DR
Senior Site Reliability Engineer: Designing, implementing, and evolving large-scale, cloud-native infrastructure for a global SaaS platform with an accent on reliability and scalability initiatives, infrastructure-as-code, and GitOps practices. Focus on proactively identifying systemic reliability issues, leading major incident response, and integrating reliability principles into the development lifecycle.
Location: Remote work is restricted to Malaysia only. hirify.global does not sponsor work visas or relocation.
Company
hirify.global is a leading database company making a significant impact globally, providing the backbone for applications used daily by companies, including 75% of the Fortune 500.
What you will do
- Design, implement, and evolve large-scale, cloud-native infrastructure supporting a global SaaS platform.
- Lead reliability and scalability initiatives, driving automation and resilience through infrastructure-as-code and GitOps.
- Proactively identify and remediate systemic reliability issues, ensuring high service availability and performance.
- Collaborate with software and platform teams to integrate reliability principles, SLOs, and observability standards.
- Act as a key technical leader during major incidents, coordinating response efforts and conducting root cause analysis.
- Contribute to continuous improvement by defining infrastructure patterns and mentoring other engineers.
Requirements
- At least 7 years of hands-on experience as an SRE, DevOps, or Infrastructure Engineer in production cloud environments.
- Strong expertise with Kubernetes operations and ecosystem tooling in production-scale clusters.
- Proven experience designing and maintaining multi-cloud infrastructure across Azure, AWS, or GCP.
- Advanced proficiency with Terraform and Terragrunt for designing modular, reusable, and secure IaC components.
- Solid understanding of GitOps principles and deployment automation using ArgoCD or similar tools.
- Deep experience with Linux systems administration, performance tuning, and troubleshooting.
- Proficiency in one or more programming/scripting languages (Python, Bash, Go preferred).
- Strong understanding of observability concepts and experience with Prometheus, Grafana, and Thanos.
- Experience participating in or leading on-call rotations, handling incident response, and conducting post-incident reviews.
- English: B2 required.
- Work authorization for Malaysia is required, as hirify.global does not sponsor work visas.
Nice to have
- Hands-on experience with large-scale multi-cloud Kubernetes clusters or hybrid cloud setups.
- Experience building or operating self-healing and auto-scaling systems.
- Certification in one or more major cloud providers (Azure, AWS, or GCP) or Kubernetes (CKA/CKAD).
- Contributions to open-source projects, reliability frameworks, or IaC tooling.
- Experience mentoring teams and promoting engineering excellence across an organization.
Culture & Benefits
- Impact the world of technology by pushing boundaries in technology and business models.
- Collaborate with high-caliber colleagues around the world, offering unparalleled learning and growth opportunities.
- Provide a very competitive compensation package and 25 days paid annual leave (plus holidays).
- Offer a massive degree of flexibility and freedom.
- hirify.global is an equal opportunity employer dedicated to creating a welcoming and inclusive workplace for everyone.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →