обновлено 6 дней назад
Senior Software Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Senior Software Engineer (AI): Building and automating large-scale GPU cluster provisioning and operations across bare metal, Kubernetes, and Slurm environments with an accent on platform scalability and reliability. Focus on developing Kubernetes Operators, gRPC/REST APIs in Go, and end-to-end infrastructure automation pipelines.
Location: On-site in Las Vegas, Nevada
Company
is a cloud platform provider delivering seamless, secure, and resilient AI compute at scale.
What you will do
- Build and maintain fully automated pipelines for provisioning bare metal GPU clusters from zero to production.
- Automate Slurm and Kubernetes cluster lifecycle, including bootstrapping, upgrades, and decommissioning at scale.
- Develop infrastructure for GPU node configuration, including drivers and firmware.
- Own cluster validation pipelines, automating health checks and GPU burn-in tests.
- Build day-2 operations automation, including node remediation and rolling upgrades.
- Own the full observability stack for automation services and cluster health systems.
Requirements
- 5+ years in infrastructure or platform engineering.
- 3+ years of experience writing production Go.
- Deep understanding of Kubernetes internals (Informers, Controller-runtime, client-go, CRDs, Operators, and Admission webhooks).
- Experience building production-scale gRPC and REST APIs in Go.
- Familiarity with bare metal infrastructure concepts (PXE, IPMI, BMC).
- Authorization to work in the United States is required.
Nice to have
- Knowledge of GPU workload infrastructure and RoCE networking automation.
- Experience with GitOps tools like ArgoCD.
- Experience with CI/CD tools such as GitHub Actions and Argo Workflows.
- Experience with Ansible and Terraform.
Culture & Benefits
- Stock options and competitive equity.
- 100% paid Medical, Dental, and Vision insurance for employees.
- Company contributions to Health Savings Account (HSA).
- 401(k) and comprehensive disability and life insurance.
- Flexible PTO and paid holidays.
- Parental leave and Employee Assistance Program.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →
Похожие вакансии
2 дня назад
Site Reliability Engineer (Kubernetes)
6 дней назад
Senior Software Engineer, Infrastructure & Tools (DevOps)
20 часов назад
Senior Site Reliability Engineer (AI)
156 000 - 262 000$
CoinTracker
2 дня назад
Senior Infrastructure Engineer (AI)
166 000 - 195 000$
2 дня назад
Staff Development Experience Engineer (DevOps)
6 дней назад