Senior Reliability Engineer (AWS/Kubernetes)

Формат работы

remote (только Brazil)

Тип работы

fulltime

Грейд

senior

Английский

Страна

US/Brazil

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Reliability Engineer (AWS/Kubernetes): Operating, observing, and improving the reliability of distributed systems running on AWS and Kubernetes with an accent on observability, operational maturity, and automated responses to system behavior. Focus on defining SLIs/SLOs, enhancing autoscaling/self-healing mechanisms, and driving root cause analysis for production incidents.

São Paulo, 100% Remote

Company

Leading nearshore staff augmentation provider headquartered in New York with 600+ tech professionals based in Latin America partnering with U.S. companies on digital transformation projects.

What you will do

Design and improve observability strategies including metrics, logs, traces, alerts, and dashboards across services.
Analyze production system behavior to identify failure modes, bottlenecks, and reliability risks.
Maintain and evolve AWS CDK/CDK8s constructs focused on observability, autoscaling, and safeguards.
Operate core platform components like VPC, EKS clusters, RDS, OpenSearch, MSK, and Kubernetes addons.
Define SLIs/SLOs, alerting strategies, and automated responses including self-healing and runbooks.
Collaborate on incident investigations, root cause analysis, CI/CD for IaC, and apply SRE principles.

Requirements

5+ years in Site Reliability Engineering, Platform Engineering, or Infrastructure roles with production systems experience.
Strong observability operations: metrics, logs, traces, dashboards, alerts for complex systems.
Hands-on with AWS services (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts).
Fluency in Python and IaC with AWS CDK, CDK8s or equivalent.
Prometheus, Grafana, alert tuning, incident-driven monitoring improvements.
Experience improving existing systems for operational excellence and reliability using observability data.

Nice to have

Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.

Culture & Benefits

100% remote work with freedom to choose your location, laptop and internet only.
Highly competitive USD compensation exceeding market rates.
Paid time off policies for well-being and recharge.
Autonomy in managing time, focus on results over hours.
Work on high-impact projects with top U.S. companies.
Well-being focused culture, work-life balance, engagement activities, multicultural team across 25+ countries.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →