TL;DR
Site Reliability Engineer (AI & ML Infrastructure): Build and operate a hybrid infrastructure platform spanning AWS and bare metal data centers for AI/ML research and product development with an accent on Kubernetes, Terraform, and GPU workload orchestration. Focus on architecting scalable, automated, and high-performance environments integrating Slurm and managing complex hybrid cloud infrastructure.
Location: Remote within the United States only
Salary: $150,000–$220,000
Company
hirify.global is a leading AI-driven voice platform powering real-time speech-to-text and voice agent solutions, serving over 1,300 organizations globally with advanced voice-native foundation models.
What you will do
- Architect and maintain Kubernetes-based computing platforms on AWS and on-premise bare metal infrastructure.
- Develop and manage infrastructure using Terraform following Infrastructure-as-Code principles.
- Design and optimize AI/ML job scheduling with Slurm integrated into Kubernetes clusters for GPU resource management.
- Provision and manage high-performance GPU bare metal servers and associated networking and storage solutions.
- Build observability stacks for monitoring, logging, and automation of operational tasks and incident response.
- Collaborate with AI researchers and ML engineers to build tools and workflows that accelerate development cycles.
Requirements
- Must have 5+ years experience in Platform Engineering, DevOps, or Site Reliability Engineering.
- Expertise in Kubernetes architecture and operations at scale.
- Proven hands-on experience with Terraform and Infrastructure-as-Code.
- Experience with HPC job schedulers like Slurm for GPU-intensive AI workloads.
- Experience managing bare metal infrastructure including provisioning and lifecycle management.
- Strong scripting and automation skills in Python, Go, or Bash.
Nice to have
- Experience with CI/CD systems such as GitLab CI, Jenkins, or ArgoCD.
- Familiarity with FinOps and cloud cost optimization strategies.
- Knowledge of Kubernetes networking and storage solutions like Calico, Cilium, Ceph, or Rook.
- Experience in multi-region or hybrid cloud environments.
Culture & Benefits
- Comprehensive medical, dental, vision, and mental health benefits.
- Unlimited PTO, paid parental leave, flexible schedules, and US company holidays.
- 401(k) plan with company match and tax savings programs.
- Learning and education stipends, participation in talks, conferences, and AI enablement workshops.
- Supportive, collaborative culture with a focus on AI-first mindset and continuous learning.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →