Senior Reliaibility Engineer - Technology

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Страна

US/Colombia

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Reliability Engineer (AWS/Kubernetes): Operating, observing, and improving reliability of distributed systems with an accent on observability, automated responses, and production behavior analysis. Focus on designing SLIs/SLOs, enhancing autoscaling/self-healing, and driving incident root cause analysis for high resilience.

Location: 100% Remote (Bogota office mentioned; team primarily based in Latin America)

Company

Leading nearshore staff augmentation provider headquartered in New York, partnering with U.S. companies and delivering tech solutions with 600+ professionals across Latin America.

What you will do

Design and improve observability strategies including metrics, logs, traces, alerts, and dashboards across services.
Analyze production system behavior, failure modes, bottlenecks, and reliability risks.
Maintain AWS CDK/CDK8s constructs for observability, autoscaling, and safeguards; operate VPC, EKS, RDS, OpenSearch, MSK.
Enhance Kubernetes addons like ingress, cert-manager, autoscalers, monitoring stacks.
Define SLIs, SLOs, alerting; improve automated responses, self-healing, and runbooks.
Collaborate on incident investigations, RCA, CI/CD for IaC, and apply SRE principles like error budgets.

Requirements

5+ years in SRE, Platform Engineering, or Infrastructure with production systems experience.
Strong observability ops: metrics, logs, traces, dashboards, alerts for complex systems.
Hands-on AWS (VPC, IAM, RDS, MSK, S3, CloudWatch), Kubernetes (Helm, RBAC, ServiceAccounts).
Fluency in Python; IaC with AWS CDK, CDK8s or equivalent.
Prometheus, Grafana, alert tuning, incident monitoring improvements.
Experience improving existing systems for operational excellence using observability data.

Nice to have

Supporting Spark on Kubernetes, Argo, or Kafka batch pipelines.
Designing reusable infrastructure/observability patterns or platform tooling.

Culture & Benefits

100% remote work with autonomy focused on results.
Competitive USD pay, paid time off for well-being.
Work with top U.S. companies on high-impact projects.
Culture emphasizing work-life balance, engagement activities, multicultural team across 25+ countries.
Collaborate with senior experts in dynamic, diverse network.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →