Site Reliability Engineer (AI & ML Infrastructure)

150 000 - 220 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI & ML Infrastructure): Build and operate a hybrid infrastructure platform spanning AWS and bare metal data centers for AI/ML research and product development with an accent on Kubernetes, Terraform, and GPU workload orchestration. Focus on architecting scalable, automated, and high-performance environments integrating Slurm and managing complex hybrid cloud infrastructure.

Location: Remote within the United States only

Salary: $150,000–$220,000

Company

hirify.global is a leading AI-driven voice platform powering real-time speech-to-text and voice agent solutions, serving over 1,300 organizations globally with advanced voice-native foundation models.

What you will do

Architect and maintain Kubernetes-based computing platforms on AWS and on-premise bare metal infrastructure.
Develop and manage infrastructure using Terraform following Infrastructure-as-Code principles.
Design and optimize AI/ML job scheduling with Slurm integrated into Kubernetes clusters for GPU resource management.
Provision and manage high-performance GPU bare metal servers and associated networking and storage solutions.
Build observability stacks for monitoring, logging, and automation of operational tasks and incident response.
Collaborate with AI researchers and ML engineers to build tools and workflows that accelerate development cycles.

Requirements

Must have 5+ years experience in Platform Engineering, DevOps, or Site Reliability Engineering.
Expertise in Kubernetes architecture and operations at scale.
Proven hands-on experience with Terraform and Infrastructure-as-Code.
Experience with HPC job schedulers like Slurm for GPU-intensive AI workloads.
Experience managing bare metal infrastructure including provisioning and lifecycle management.
Strong scripting and automation skills in Python, Go, or Bash.

Nice to have

Experience with CI/CD systems such as GitLab CI, Jenkins, or ArgoCD.
Familiarity with FinOps and cloud cost optimization strategies.
Knowledge of Kubernetes networking and storage solutions like Calico, Cilium, Ceph, or Rook.
Experience in multi-region or hybrid cloud environments.

Culture & Benefits

Comprehensive medical, dental, vision, and mental health benefits.
Unlimited PTO, paid parental leave, flexible schedules, and US company holidays.
401(k) plan with company match and tax savings programs.
Learning and education stipends, participation in talks, conferences, and AI enablement workshops.
Supportive, collaborative culture with a focus on AI-first mindset and continuous learning.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →