Senior Reliability Engineer (AWS/Python)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Reliability Engineer (AWS/Python): Operating, observing, and improving reliability of distributed AWS and Kubernetes systems with an accent on observability, operational maturity, and automated responses to production behavior. Focus on designing observability strategies, defining SLIs/SLOs/alerting, and enhancing autoscaling, self-healing, and remediation mechanisms.
Location: LatAm, 100% Remote
Company
Leading nearshore staff augmentation provider headquartered in New York with 600+ professionals partnering with U.S. companies on digital transformation.
What you will do
- Design, implement, and improve observability strategies including metrics, logs, traces, alerts, and dashboards.
- Analyze production system behavior to identify failure modes, bottlenecks, and reliability risks.
- Evolve AWS CDK and CDK8s constructs focused on observability, autoscaling, and safeguards.
- Maintain core platform components like VPC, EKS, RDS, OpenSearch, MSK exposing operational signals.
- Operate Kubernetes addons including ingress, cert-manager, autoscalers, monitoring stacks.
- Define SLIs, SLOs, alerting strategies and improve automated responses and incident recovery.
- Collaborate on production incidents, root cause analysis, and long-term reliability improvements.
- Own CI/CD for IaC and observability components applying SRE principles.
Requirements
- 5+ years in Site Reliability Engineering, Platform Engineering, or Infrastructure with production systems support.
- Strong observability operations: metrics, logs, traces, dashboards, alerts for complex systems.
- Hands-on AWS (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts).
- Fluency in Python and IaC with AWS CDK, CDK8s or equivalent.
- Prometheus, Grafana, alert tuning, noise reduction, incident-driven improvements.
- Experience improving existing systems for operational excellence and reliability using observability data.
Nice to have
- Experience with Spark on Kubernetes, Argo, or Kafka-based batch pipelines.
Culture & Benefits
- 100% remote work with freedom to choose your location (laptop and internet required).
- Highly competitive USD compensation exceeding market rates.
- Paid time off to support well-being and recharge.
- Autonomy to manage time focusing on results.
- Work on high-impact projects with top U.S. companies.
- Culture prioritizing work-life balance, engagement activities, and collaboration with seasoned experts.
- Diverse multicultural team across 25+ countries.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →