Назад
Company hidden
1 день назад

Site Reliability Engineer (AI & ML Infrastructure)

150 000 - 220 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (AI & ML Infrastructure): Build and operate a hybrid infrastructure platform spanning AWS and bare metal data centers for AI/ML research and product development with an accent on Kubernetes, Terraform, and GPU workload orchestration. Focus on architecting scalable, automated, and high-performance environments integrating Slurm and managing complex hybrid cloud infrastructure.

Location: Remote within the United States only

Salary: $150,000–$220,000

Company

hirify.global is a leading AI-driven voice platform powering real-time speech-to-text and voice agent solutions, serving over 1,300 organizations globally with advanced voice-native foundation models.

What you will do

  • Architect and maintain Kubernetes-based computing platforms on AWS and on-premise bare metal infrastructure.
  • Develop and manage infrastructure using Terraform following Infrastructure-as-Code principles.
  • Design and optimize AI/ML job scheduling with Slurm integrated into Kubernetes clusters for GPU resource management.
  • Provision and manage high-performance GPU bare metal servers and associated networking and storage solutions.
  • Build observability stacks for monitoring, logging, and automation of operational tasks and incident response.
  • Collaborate with AI researchers and ML engineers to build tools and workflows that accelerate development cycles.

Requirements

  • Must have 5+ years experience in Platform Engineering, DevOps, or Site Reliability Engineering.
  • Expertise in Kubernetes architecture and operations at scale.
  • Proven hands-on experience with Terraform and Infrastructure-as-Code.
  • Experience with HPC job schedulers like Slurm for GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure including provisioning and lifecycle management.
  • Strong scripting and automation skills in Python, Go, or Bash.

Nice to have

  • Experience with CI/CD systems such as GitLab CI, Jenkins, or ArgoCD.
  • Familiarity with FinOps and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking and storage solutions like Calico, Cilium, Ceph, or Rook.
  • Experience in multi-region or hybrid cloud environments.

Culture & Benefits

  • Comprehensive medical, dental, vision, and mental health benefits.
  • Unlimited PTO, paid parental leave, flexible schedules, and US company holidays.
  • 401(k) plan with company match and tax savings programs.
  • Learning and education stipends, participation in talks, conferences, and AI enablement workshops.
  • Supportive, collaborative culture with a focus on AI-first mindset and continuous learning.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...