Senior DevOps / Platform Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior DevOps / Platform Reliability Engineer (AWS/Kubernetes): Building and scaling the infrastructure and observability backbone for an agentic CX automation platform with an accent on AI-native DevOps and multi-tenant enterprise isolation. Focus on designing auto-remediation agents, implementing LLM-based observability, and ensuring SOC 2/HIPAA compliance.
Location: Flexible remote work from anywhere
Company
is an intelligent process automation platform reimagining customer experience operations for top support leaders.
What you will do
- Own and evolve CI/CD pipelines using GitHub Actions and OIDC-based authentication for microservices and agentic workloads.
- Operate and scale the Kubernetes platform (EKS + Argo CD), ensuring multi-tenant isolation for enterprise customers.
- Automate infrastructure provisioning using Terraform and CloudFormation.
- Build observability as a product using Prometheus, Grafana, and OpenTelemetry, including telemetry for LLM and agentic systems.
- Develop AI-native DevOps capabilities, including auto-remediation agents for production toil and LLM-driven incident triage.
- Strengthen security and compliance posture for SOC 2 and HIPAA using least-privilege IAM and automated evidence collection.
Requirements
- 5+ years of experience in DevOps, SRE, or Platform Engineering operating production systems on AWS.
- Deep expertise with production EKS environments, Argo CD, and Infrastructure as Code (Terraform/CloudFormation).
- Strong AWS networking experience (VPC, Route 53, ALB/NLB) and data tier management (Aurora MySQL, Redis, MSK/Kafka).
- Proven experience with Prometheus, Grafana, and OpenTelemetry.
- Experience managing Cloudflare services (WAF, Bot Management, Zero Trust) and Lambda workloads.
- Proficiency in Python, Bash, and Linux systems.
Nice to have
- Experience operating LLM or ML workloads in production (Bedrock, pgvector, LiteLLM).
- Experience building MCP servers or deploying agent frameworks such as LangGraph or CrewAI.
Culture & Benefits
- 100% of employee health premiums covered and 75-80% of dependent premiums.
- 401(k) plans to support retirement planning.
- Unlimited PTO and paid parental leave.
- $200/month co-working reimbursement and home office stipend ($500 setup + $100/month for utilities).
- Engineering culture that biases toward automation over toil and values blameless incident reviews.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →