TL;DR
Senior Deployment Engineer (AI): Leading the hands-on bringup of high-performance GPU clusters in data center environments with an accent on hardware integration, high-speed fabric tuning, and performance validation. Focus on executing end-to-end node and rack deployments, troubleshooting complex distributed hardware issues, and building repeatable, scalable infrastructure processes.
Location: Must be based in the United States (Onsite travel required)
Company
A startup building next-generation AI infrastructure and scalable GPU clusters for frontier AI workloads.
What you will do
- Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
- Validate BIOS, BMC, firmware configurations, and overall GPU cluster health.
- Configure and validate high-speed network fabrics including InfiniBand and RoCE.
- Perform cluster-wide burn-in, stress testing, and performance validation using NCCL and RDMA.
- Develop automation playbooks to transform ad-hoc deployments into repeatable, scalable systems.
- Collaborate with networking and hardware vendors to troubleshoot and resolve deployment issues.
Requirements
- Must have 5–8+ years of experience in infrastructure engineering or data center operations.
- Hands-on experience deploying GPU servers such as HGX or DGX platforms.
- Proficiency with high-speed networking fabrics including InfiniBand, RoCE, and Ethernet.
- Strong Linux systems knowledge and troubleshooting skills for distributed performance issues.
- Must be comfortable working onsite in data center environments.
- Must be authorized to work in the United States.
Nice to have
- Experience in AI/ML infrastructure or HPC environments.
- Familiarity with CUDA, NCCL, and RDMA protocols.
- Automation proficiency using Python, Ansible, Terraform, or Bash.
- Experience managing high-density power and cooling data center environments.
Culture & Benefits
- Opportunity to work on foundational AI infrastructure at a fast-growing startup.
- High-impact role with significant ownership over infrastructure build-out.
- Focus on urgency, bias toward action, and engineering excellence.
- Direct collaboration with infrastructure and hardware teams.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →