Назад
Company hidden
2 дня назад

Senior Reliaibility Engineer - Technology

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US/Colombia
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Reliability Engineer (AWS/Kubernetes): Operating, observing, and improving reliability of distributed systems with an accent on observability, automated responses, and production behavior analysis. Focus on designing SLIs/SLOs, enhancing autoscaling/self-healing, and driving incident root cause analysis for high resilience.

Location: 100% Remote (Bogota office mentioned; team primarily based in Latin America)

Company

Leading nearshore staff augmentation provider headquartered in New York, partnering with U.S. companies and delivering tech solutions with 600+ professionals across Latin America.

What you will do

  • Design and improve observability strategies including metrics, logs, traces, alerts, and dashboards across services.
  • Analyze production system behavior, failure modes, bottlenecks, and reliability risks.
  • Maintain AWS CDK/CDK8s constructs for observability, autoscaling, and safeguards; operate VPC, EKS, RDS, OpenSearch, MSK.
  • Enhance Kubernetes addons like ingress, cert-manager, autoscalers, monitoring stacks.
  • Define SLIs, SLOs, alerting; improve automated responses, self-healing, and runbooks.
  • Collaborate on incident investigations, RCA, CI/CD for IaC, and apply SRE principles like error budgets.

Requirements

  • 5+ years in SRE, Platform Engineering, or Infrastructure with production systems experience.
  • Strong observability ops: metrics, logs, traces, dashboards, alerts for complex systems.
  • Hands-on AWS (VPC, IAM, RDS, MSK, S3, CloudWatch), Kubernetes (Helm, RBAC, ServiceAccounts).
  • Fluency in Python; IaC with AWS CDK, CDK8s or equivalent.
  • Prometheus, Grafana, alert tuning, incident monitoring improvements.
  • Experience improving existing systems for operational excellence using observability data.

Nice to have

  • Supporting Spark on Kubernetes, Argo, or Kafka batch pipelines.
  • Designing reusable infrastructure/observability patterns or platform tooling.

Culture & Benefits

  • 100% remote work with autonomy focused on results.
  • Competitive USD pay, paid time off for well-being.
  • Work with top U.S. companies on high-impact projects.
  • Culture emphasizing work-life balance, engagement activities, multicultural team across 25+ countries.
  • Collaborate with senior experts in dynamic, diverse network.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →