Site Reliability Engineer (AI/ML, Kubernetes, Terraform)

160 000 - 220 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI/ML, Kubernetes, Terraform): Building and operating a hybrid infrastructure foundation for advanced AI/ML research and product development, spanning AWS and bare metal data centers with an accent on Kubernetes, Terraform, and high-demand GPU workloads using schedulers like Slurm. Focus on architecting robust, self-service environments, ensuring reproducibility, and optimizing platform performance, cost, and reliability.

Location: Remote (USA)

Salary: $160,000–$220,000

Company

hirify.global is a leading platform providing real-time APIs for speech-to-text (STT) and text-to-speech (TTS), powering voice AI solutions for over 1,300 organizations.

What you will do

Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform.
Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions.
Develop a comprehensive observability stack and create automation for operational tasks and incident response.

Requirements

5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
Proven, hands-on experience building and managing production infrastructure with Terraform.
Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
Experience managing bare metal infrastructure, including server provisioning and lifecycle management.
Strong scripting and automation skills (e.g., Python, Go, Bash).

Nice to have

Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
Familiarity with FinOps principles and cloud cost optimization strategies.
Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
Experience in a multi-region or hybrid cloud environment.

Culture & Benefits

Medical, dental, and vision benefits, annual wellness stipend, and mental health support.
Unlimited PTO, generous paid parental leave, flexible schedule, and 12 Paid US company holidays.
401(k) plan with company match and tax savings programs.
Quarterly personal productivity stipend and one-time stipend for home office upgrades.
Learning / Education stipend and participation in talks and conferences.
AI-first mindset with active use and experimentation with advanced AI tools in everyday work.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Site Reliability Engineer (AI/ML, Kubernetes, Terraform)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Staff Site Reliability Engineer (AI)

Forward Deployed Engineer, DevOps (Generative AI)

Cloud Infrastructure Engineer (IAM)

Site Reliability Engineer (AI Infrastructure)

Site Reliability Engineer (AI)

Site Reliability Engineer (Web3)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business