Staff Site Reliability Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (Cloud-native SRE): Own and evolve containerized platform underpinning critical workloads with an accent on Kubernetes clusters, service mesh, and Infrastructure as Code. Focus on designing resilient infrastructure, automation gaps closure, and applying SRE principles for high availability and operational excellence.
Location: Remote, must be based in Brazil. Remote positions may require presence at a Visa office with scheduled notice.
Company
Technology company providing a cloud-based processing platform for banking, card issuing, and payments; joined Visa in 2024 with 500+ employees across 10+ countries.
What you will do
- Own end-to-end lifecycle of core platform components including cloud primitives, Kubernetes clusters, networking, ingress, service mesh, and data-plane.
- Design and build highly reliable containerized platform applying SRE and cloud-native best practices.
- Lead infrastructure bootstrap orchestration for deterministic platform bring-up and teardown.
- Drive Infrastructure-as-Code and GitOps approach for reproducible, auditable, and automated platform components.
- Identify automation gaps, reduce manual effort, and promote SRE principles like fault isolation and capacity planning.
- Assess reliability risks, participate in on-call, incident response, and improve operability and MTTD/MTTR.
- Collaborate with engineering teams, contribute to architecture, and stay current with emerging SRE technologies.
Requirements
- Based in Brazil
- English proficiency at B1 level or above
- Strong hands-on experience with public cloud platforms, preferably AWS (Azure valued).
- Deep experience operating Kubernetes at scale (EKS or equivalent), including cluster lifecycle.
- Strong expertise with Service Mesh (Istio preferred, App Mesh or Linkerd).
- Advanced knowledge of IaC with Terraform.
- Observability tooling (logs, metrics, traces, Golden Signals), debugging distributed systems.
- Incident management, on-call, reliability engineering practices.
Nice to have
- SRE or Platform Engineering background at senior/staff level.
- Experience with critical systems, large-scale automation, security/compliance in cloud.
- 5+ years experience with Bachelor’s or 2+ years with Advanced Degree.
Culture & Benefits
- Apply strong SRE principles for operational excellence and sustainability.
- Cross-functional collaboration and technical leadership opportunities.
- Focus on high-impact solutions, continuous improvement, and sharing insights.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →