Назад
Company hidden
обновлено 1 день назад

Site Reliability Engineer (AI/ML, Kubernetes, Terraform)

160 000 - 220 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (AI/ML, Kubernetes, Terraform): Building and operating a hybrid infrastructure foundation for advanced AI/ML research and product development, spanning AWS and bare metal data centers with an accent on Kubernetes, Terraform, and high-demand GPU workloads using schedulers like Slurm. Focus on architecting robust, self-service environments, ensuring reproducibility, and optimizing platform performance, cost, and reliability.

Location: Remote (USA)

Salary: $160,000–$220,000

Company

hirify.global is a leading platform providing real-time APIs for speech-to-text (STT) and text-to-speech (TTS), powering voice AI solutions for over 1,300 organizations.

What you will do

  • Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
  • Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform.
  • Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
  • Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
  • Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions.
  • Develop a comprehensive observability stack and create automation for operational tasks and incident response.

Requirements

  • 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
  • Proven, hands-on experience building and managing production infrastructure with Terraform.
  • Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
  • Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
  • Experience managing bare metal infrastructure, including server provisioning and lifecycle management.
  • Strong scripting and automation skills (e.g., Python, Go, Bash).

Nice to have

  • Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
  • Familiarity with FinOps principles and cloud cost optimization strategies.
  • Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
  • Experience in a multi-region or hybrid cloud environment.

Culture & Benefits

  • Medical, dental, and vision benefits, annual wellness stipend, and mental health support.
  • Unlimited PTO, generous paid parental leave, flexible schedule, and 12 Paid US company holidays.
  • 401(k) plan with company match and tax savings programs.
  • Quarterly personal productivity stipend and one-time stipend for home office upgrades.
  • Learning / Education stipend and participation in talks and conferences.
  • AI-first mindset with active use and experimentation with advanced AI tools in everyday work.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...