Назад
Company hidden
3 часа назад

Senior DevOps / Platform Reliability Engineer (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior DevOps / Platform Reliability Engineer (AWS/Kubernetes): Building and scaling the infrastructure and observability backbone for an agentic CX automation platform with an accent on AI-native DevOps and multi-tenant enterprise isolation. Focus on designing auto-remediation agents, implementing LLM-based observability, and ensuring SOC 2/HIPAA compliance.

Location: Flexible remote work from anywhere

Company

hirify.global is an intelligent process automation platform reimagining customer experience operations for top support leaders.

What you will do

  • Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads.
  • Operate and scale the Kubernetes platform (EKS + Argo CD), ensuring multi-tenant isolation for enterprise customers.
  • Automate infrastructure provisioning using Terraform and CloudFormation.
  • Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems.
  • Develop AI-native DevOps capabilities, including auto-remediation agents for production toil and LLM-driven incident triage.
  • Strengthen security and compliance posture for SOC 2 and HIPAA using least-privilege IAM and automated evidence collection.

Requirements

  • 5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.
  • Deep expertise with production EKS environments, Argo CD, and Infrastructure as Code (Terraform/CloudFormation).
  • Strong AWS networking experience (VPC, Route 53, ALB/NLB) and data tier management (Aurora MySQL, Redis, MSK/Kafka).
  • Proven experience with Prometheus, Grafana, and OpenTelemetry.
  • Experience managing Cloudflare services (WAF, Bot Management, Zero Trust) and Lambda workloads.
  • Proficiency in Python, Bash, and Linux systems.

Nice to have

  • Experience operating LLM or ML workloads in production (Bedrock, pgvector, LiteLLM).
  • Experience building MCP servers or deploying agent frameworks such as LangGraph or CrewAI.

Culture & Benefits

  • 100% of employee health premiums covered and 75-80% of dependent premiums.
  • 401(k) plans to support retirement planning.
  • Unlimited PTO and paid parental leave.
  • $200/month co-working reimbursement and home office stipend ($500 setup + $100/month for utilities).
  • Engineering culture that biases toward automation over toil and values blameless incident reviews.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →