Senior Reliability Engineer (AWS/Kubernetes)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Reliability Engineer (AWS/Kubernetes): Operating, observing, and improving the reliability of distributed systems running on AWS and Kubernetes with an accent on observability, operational maturity, and automated responses to system behavior. Focus on defining SLIs/SLOs, enhancing autoscaling/self-healing mechanisms, and driving root cause analysis for production incidents.
São Paulo, 100% Remote
Company
Leading nearshore staff augmentation provider headquartered in New York with 600+ tech professionals based in Latin America partnering with U.S. companies on digital transformation projects.
What you will do
- Design and improve observability strategies including metrics, logs, traces, alerts, and dashboards across services.
- Analyze production system behavior to identify failure modes, bottlenecks, and reliability risks.
- Maintain and evolve AWS CDK/CDK8s constructs focused on observability, autoscaling, and safeguards.
- Operate core platform components like VPC, EKS clusters, RDS, OpenSearch, MSK, and Kubernetes addons.
- Define SLIs/SLOs, alerting strategies, and automated responses including self-healing and runbooks.
- Collaborate on incident investigations, root cause analysis, CI/CD for IaC, and apply SRE principles.
Requirements
- 5+ years in Site Reliability Engineering, Platform Engineering, or Infrastructure roles with production systems experience.
- Strong observability operations: metrics, logs, traces, dashboards, alerts for complex systems.
- Hands-on with AWS services (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts).
- Fluency in Python and IaC with AWS CDK, CDK8s or equivalent.
- Prometheus, Grafana, alert tuning, incident-driven monitoring improvements.
- Experience improving existing systems for operational excellence and reliability using observability data.
Nice to have
- Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines.
Culture & Benefits
- 100% remote work with freedom to choose your location, laptop and internet only.
- Highly competitive USD compensation exceeding market rates.
- Paid time off policies for well-being and recharge.
- Autonomy in managing time, focus on results over hours.
- Work on high-impact projects with top U.S. companies.
- Well-being focused culture, work-life balance, engagement activities, multicultural team across 25+ countries.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →