Senior Site Reliability Engineer (Kubernetes)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Site Reliability Engineer (Kubernetes): Building, operating, and scaling a global multi-region SaaS platform with an accent on infrastructure automation, Kubernetes orchestration, and system reliability. Focus on designing resilient service mesh architectures, maintaining multi-region data layers, and optimizing CI/CD workflows for high-availability production environments.
Location: Must be based in Canada
Salary: CA$144,780 – CA$202,825
Company
is a leading developer of API and AI connectivity technology, providing a unified platform to secure, manage, and accelerate the flow of intelligence across APIs and AI models.
What you will do
- Operate and scale the global Konnect SaaS platform to ensure reliability, availability, and high performance across multiple cloud providers.
- Manage Kubernetes-based infrastructure using Terraform, Terragrunt, Helm, and ArgoCD to drive consistent service delivery.
- Design and optimize multi-region data layers and caching systems including PostgreSQL, Redis, ClickHouse, and Druid.
- Maintain and enhance CI/CD pipelines and GitOps workflows to automate infrastructure changes.
- Drive observability and incident response through monitoring tools like Datadog, Prometheus, Grafana, and Thanos.
- Participate in a 24/7 on-call rotation and lead continuous improvement through postmortems and playbooks.
Requirements
- Must be based in Canada
- Deep expertise in Kubernetes, specifically in debugging networking and cluster-level issues.
- Strong proficiency in Infrastructure as Code using Terraform or Terragrunt.
- Experience with CI/CD and GitOps tooling such as ArgoCD or Helm.
- Proficiency in at least one programming language: Go, Python, or Bash.
- Solid understanding of distributed systems, Linux/Unix, and networking (DNS, TLS/SSL).
- Professional experience in a 24/7/365 production support environment.
Nice to have
- Hands-on experience with Gateway or Mesh.
- Experience operating time-series and analytics databases like ClickHouse or Druid.
- Knowledge of cloud networking (AWS PrivateLink, VPC Peering, or similar).
- Understanding of disaster recovery and compliance-driven reliability practices.
Culture & Benefits
- Competitive salary and compensation packages based on market benchmarks.
- Opportunities to work on high-load, global infrastructure impacting thousands of customers.
- Collaborative environment working alongside security and development teams on mission-critical SaaS systems.
- Commitment to operational excellence through robust postmortem and playbook practices.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →