Назад
Company hidden
2 дня назад

Senior Reliability Engineer (AWS/Python)

Формат работы
remote
Тип работы
fulltime
Грейд
senior
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Reliability Engineer (AWS/Python): Operating, observing, and improving reliability of distributed AWS and Kubernetes systems with an accent on observability, operational maturity, and automated responses to production behavior. Focus on designing observability strategies, defining SLIs/SLOs/alerting, and enhancing autoscaling, self-healing, and remediation mechanisms.

Location: LatAm, 100% Remote

Company

Leading nearshore staff augmentation provider headquartered in New York with 600+ professionals partnering with U.S. companies on digital transformation.

What you will do

  • Design, implement, and improve observability strategies including metrics, logs, traces, alerts, and dashboards.
  • Analyze production system behavior to identify failure modes, bottlenecks, and reliability risks.
  • Evolve AWS CDK and CDK8s constructs focused on observability, autoscaling, and safeguards.
  • Maintain core platform components like VPC, EKS, RDS, OpenSearch, MSK exposing operational signals.
  • Operate Kubernetes addons including ingress, cert-manager, autoscalers, monitoring stacks.
  • Define SLIs, SLOs, alerting strategies and improve automated responses and incident recovery.
  • Collaborate on production incidents, root cause analysis, and long-term reliability improvements.
  • Own CI/CD for IaC and observability components applying SRE principles.

Requirements

  • 5+ years in Site Reliability Engineering, Platform Engineering, or Infrastructure with production systems support.
  • Strong observability operations: metrics, logs, traces, dashboards, alerts for complex systems.
  • Hands-on AWS (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts).
  • Fluency in Python and IaC with AWS CDK, CDK8s or equivalent.
  • Prometheus, Grafana, alert tuning, noise reduction, incident-driven improvements.
  • Experience improving existing systems for operational excellence and reliability using observability data.

Nice to have

  • Experience with Spark on Kubernetes, Argo, or Kafka-based batch pipelines.

Culture & Benefits

  • 100% remote work with freedom to choose your location (laptop and internet required).
  • Highly competitive USD compensation exceeding market rates.
  • Paid time off to support well-being and recharge.
  • Autonomy to manage time focusing on results.
  • Work on high-impact projects with top U.S. companies.
  • Culture prioritizing work-life balance, engagement activities, and collaboration with seasoned experts.
  • Diverse multicultural team across 25+ countries.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →