Назад
Company hidden
2 дня назад

Senior Reliability Engineer (AWS/Kubernetes)

Формат работы
remote (только Mexico)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
Mexico
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Reliability Engineer (AWS/Kubernetes): Operating, observing, and improving reliability of distributed systems with an accent on observability, operational maturity, and automated responses to system behavior. Focus on designing observability strategies, defining SLIs/SLOs, enhancing autoscaling/self-healing mechanisms, and performing root cause analysis for production incidents.

Location: Mexico City, 100% remote

Company

Leading nearshore staff augmentation provider headquartered in New York with 600+ tech professionals based in Latin America partnering with U.S. companies on digital transformation projects.

What you will do

  • Design and improve observability strategies including metrics, logs, traces, alerts, and dashboards across services.
  • Analyze system behavior in production to identify failure modes, bottlenecks, and risks.
  • Maintain AWS CDK/CDK8s constructs focused on observability, autoscaling, and safeguards.
  • Operate core platform components like VPC, EKS, RDS, OpenSearch, MSK, and Kubernetes addons.
  • Define SLIs, SLOs, alerting strategies, and automated responses including self-healing and runbooks.
  • Collaborate on incident investigations, root cause analysis, CI/CD for IaC, and apply SRE principles.

Requirements

  • 5+ years in SRE, Platform Engineering, or Infrastructure roles with hands-on production systems support
  • Strong observability experience: metrics, logs, traces, dashboards, alerts for complex systems
  • Hands-on with AWS (VPC, IAM, RDS, MSK, S3, CloudWatch) and Kubernetes (Helm, RBAC, ServiceAccounts)
  • Fluency in Python and IaC with AWS CDK, CDK8s or equivalent
  • Prometheus, Grafana, alert tuning, incident-driven monitoring improvements
  • Experience improving existing systems for operational excellence and reliability

Nice to have

  • Experience supporting Spark on Kubernetes, Argo, or Kafka-based batch pipelines

Culture & Benefits

  • 100% remote work with autonomy focused on results
  • Highly competitive USD pay
  • Paid time off for well-being
  • Work with top U.S. companies on high-impact projects
  • Diverse global network across 25+ countries with emphasis on work-life balance and engagement activities

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →