Staff Site Reliability Engineer (AWS/Kubernetes)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Site Reliability Engineer (AWS/Kubernetes): Defining and elevating reliability standards across the healthcare platform with an accent on systemic reliability risks and cross-cutting solutions. Focus on designing scalable SLO frameworks, leading complex incident response, and driving observability strategies across distributed systems.
Location: Must be based in Colombia (Virtual-first environment with hybrid options)
Company
is a technology-enabled care platform transforming healthcare delivery by providing convenient, affordable, and effective care on a global scale.
What you will do
- Define and evolve platform-wide reliability standards, patterns, and tooling.
- Design and implement cross-cutting mechanisms such as circuit breakers, retry policies, and load shedding.
- Establish scalable SLO frameworks and lead complex, multi-service incident response as an incident commander.
- Drive observability strategies using metrics, logs, traces, and alerting systems to reduce time to resolution.
- Collaborate with Platform Engineering to strengthen Kubernetes (EKS), networking, and data system reliability.
- Mentor senior engineers through design reviews and promote a culture of proactive operational excellence.
Requirements
- 8+ years of experience in SRE, infrastructure, or production engineering roles.
- Deep expertise in AWS environments and Kubernetes (preferably EKS).
- Hands-on experience with Infrastructure as Code tools such as Terraform or CDKTF.
- Advanced understanding of distributed systems, networking, and failure modes.
- Experience designing and managing observability stacks (Prometheus, Grafana, OpenTelemetry).
- Must be located in Colombia to access local benefits and medical coverage.
Nice to have
- Experience with service mesh technologies (e.g., Istio) and mTLS.
- Familiarity with GitOps workflows (e.g., ArgoCD, Flux).
- Experience working in compliance-driven environments like HIPAA, SOC2, or FedRAMP.
- Exposure to chaos engineering practices and FinOps (cost-aware infrastructure design).
Culture & Benefits
- Virtual-first work environment with a hybrid allowance and work-life flexibility.
- Comprehensive health benefits including Medical Plan Coverage by Colmédica and Pan American.
- Generous leave policies: 18 weeks maternity leave and dedicated Mental Health Days.
- Summer Fridays and an annual bonus program.
- Professional growth opportunities via LinkedIn Learning and tuition reimbursement.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →