TL;DR
Site Reliability Engineer (AI/ML, Kubernetes, Terraform): Building and operating a hybrid infrastructure foundation for advanced AI/ML research and product development, spanning AWS and bare metal data centers with an accent on Kubernetes, Terraform, and high-demand GPU workloads using schedulers like Slurm. Focus on architecting robust, self-service environments, ensuring reproducibility, and optimizing platform performance, cost, and reliability.
Location: Remote (USA)
Salary: $160,000–$220,000
Company
hirify.global is a leading platform providing real-time APIs for speech-to-text (STT) and text-to-speech (TTS), powering voice AI solutions for over 1,300 organizations.
What you will do
- Architect and maintain the core computing platform using Kubernetes on AWS and on-premise.
- Develop and manage infrastructure using Infrastructure-as-Code (IaC) principles with Terraform.
- Design, build, and optimize AI/ML job scheduling and orchestration systems, integrating Slurm with Kubernetes clusters.
- Provision, manage, and maintain on-premise bare metal server infrastructure for high-performance GPU computing.
- Implement and manage the platform's networking (CNI, service mesh) and storage (CSI, S3) solutions.
- Develop a comprehensive observability stack and create automation for operational tasks and incident response.
Requirements
- 5+ years of experience in Platform Engineering, DevOps, or Site Reliability Engineering (SRE).
- Proven, hands-on experience building and managing production infrastructure with Terraform.
- Expert-level knowledge of Kubernetes architecture and operations in a large-scale environment.
- Experience with high-performance compute (HPC) job schedulers, specifically Slurm, for managing GPU-intensive AI workloads.
- Experience managing bare metal infrastructure, including server provisioning and lifecycle management.
- Strong scripting and automation skills (e.g., Python, Go, Bash).
Nice to have
- Experience with CI/CD systems (e.g., GitLab CI, Jenkins, ArgoCD) and building developer tooling.
- Familiarity with FinOps principles and cloud cost optimization strategies.
- Knowledge of Kubernetes networking (e.g., Calico, Cilium) and storage (e.g., Ceph, Rook) solutions.
- Experience in a multi-region or hybrid cloud environment.
Culture & Benefits
- Medical, dental, and vision benefits, annual wellness stipend, and mental health support.
- Unlimited PTO, generous paid parental leave, flexible schedule, and 12 Paid US company holidays.
- 401(k) plan with company match and tax savings programs.
- Quarterly personal productivity stipend and one-time stipend for home office upgrades.
- Learning / Education stipend and participation in talks and conferences.
- AI-first mindset with active use and experimentation with advanced AI tools in everyday work.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →