Senior Reliability Engineer (AWS/Python)

Формат работы

remote

Тип работы

fulltime

Грейд

senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Reliability Engineer (AWS/Python): Operating, observing, and improving reliability of distributed AWS and Kubernetes systems with an accent on observability, operational maturity, and automated responses to production behavior. Focus on designing observability strategies, defining SLIs/SLOs/alerting, and enhancing autoscaling, self-healing, and remediation mechanisms.

Location: LatAm, 100% Remote

Company

Leading nearshore staff augmentation provider headquartered in New York with 600+ professionals partnering with U.S. companies on digital transformation.

What you will do

Design, implement, and improve observability strategies including metrics, logs, traces, alerts, and dashboards.
Analyze production system behavior to identify failure modes, bottlenecks, and reliability risks.
Evolve AWS CDK and CDK8s constructs focused on observability, autoscaling, and safeguards.
Maintain core platform components like VPC, EKS, RDS, OpenSearch, MSK exposing operational signals.
Operate Kubernetes addons including ingress, cert-manager, autoscalers, monitoring stacks.
Define SLIs, SLOs, alerting strategies and improve automated responses and incident recovery.
Collaborate on production incidents, root cause analysis, and long-term reliability improvements.
Own CI/CD for IaC and observability components applying SRE principles.

Requirements

5+ years in Site Reliability Engineering, Platform Engineering, or Infrastructure with production systems support.
Strong observability operations: metrics, logs, traces, dashboards, alerts for complex systems.
Hands-on AWS (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts).
Fluency in Python and IaC with AWS CDK, CDK8s or equivalent.
Prometheus, Grafana, alert tuning, noise reduction, incident-driven improvements.
Experience improving existing systems for operational excellence and reliability using observability data.

Nice to have

Experience with Spark on Kubernetes, Argo, or Kafka-based batch pipelines.

Culture & Benefits

100% remote work with freedom to choose your location (laptop and internet required).
Highly competitive USD compensation exceeding market rates.
Paid time off to support well-being and recharge.
Autonomy to manage time focusing on results.
Work on high-impact projects with top U.S. companies.
Culture prioritizing work-life balance, engagement activities, and collaboration with seasoned experts.
Diverse multicultural team across 25+ countries.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →