Senior DevOps / Platform Reliability Engineer (AI)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior DevOps / Platform Reliability Engineer (AWS/Kubernetes): Building and scaling the infrastructure and observability backbone for an agentic CX automation platform with an accent on AI-native DevOps and multi-tenant enterprise isolation. Focus on designing auto-remediation agents, implementing LLM-based observability, and ensuring SOC 2/HIPAA compliance.

Location: Flexible remote work from anywhere

Company

hirify.global is an intelligent process automation platform reimagining customer experience operations for top support leaders.

What you will do

Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads.
Operate and scale the Kubernetes platform (EKS + Argo CD), ensuring multi-tenant isolation for enterprise customers.
Automate infrastructure provisioning using Terraform and CloudFormation.
Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems.
Develop AI-native DevOps capabilities, including auto-remediation agents for production toil and LLM-driven incident triage.
Strengthen security and compliance posture for SOC 2 and HIPAA using least-privilege IAM and automated evidence collection.

Requirements

5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.
Deep expertise with production EKS environments, Argo CD, and Infrastructure as Code (Terraform/CloudFormation).
Strong AWS networking experience (VPC, Route 53, ALB/NLB) and data tier management (Aurora MySQL, Redis, MSK/Kafka).
Proven experience with Prometheus, Grafana, and OpenTelemetry.
Experience managing Cloudflare services (WAF, Bot Management, Zero Trust) and Lambda workloads.
Proficiency in Python, Bash, and Linux systems.

Nice to have

Experience operating LLM or ML workloads in production (Bedrock, pgvector, LiteLLM).
Experience building MCP servers or deploying agent frameworks such as LangGraph or CrewAI.

Culture & Benefits

100% of employee health premiums covered and 75-80% of dependent premiums.
401(k) plans to support retirement planning.
Unlimited PTO and paid parental leave.
$200/month co-working reimbursement and home office stipend ($500 setup + $100/month for utilities).
Engineering culture that biases toward automation over toil and values blameless incident reviews.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →