Principal Deployment Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Principal Deployment Engineer (AI Infrastructure): Leading hands-on bringup of GPU clusters in data center environments with an accent on hardware integration, high-speed networking, and performance validation. Focus on building repeatable deployment processes, troubleshooting complex distributed systems, and ensuring production readiness for large-scale AI workloads.
Location: Must be based in the United States (Travel Required)
Company
is a startup building next-generation AI infrastructure, delivering performant and scalable GPU clusters for frontier AI training and inference.
What you will do
- Execute end-to-end bringup of GPU nodes and racks from installation to production readiness.
- Validate BIOS, BMC, firmware configurations, and GPU health.
- Configure and validate high-speed network fabrics including InfiniBand and RoCE.
- Perform cluster-wide burn-in, stress testing, and performance validation using NCCL and RDMA.
- Contribute to automation for provisioning and improve deployment playbooks.
- Coordinate with hardware vendors and cross-functional teams to resolve bringup issues.
Requirements
- Must be based in the United States and comfortable with travel.
- 7–8+ years in infrastructure engineering, hardware deployment, or data center operations.
- Hands-on experience deploying GPU servers such as HGX or DGX platforms.
- Strong knowledge of high-speed networking fabrics and Linux systems.
- Experience troubleshooting distributed systems performance issues.
Nice to have
- Experience in AI/ML infrastructure or HPC environments.
- Familiarity with NCCL, CUDA, and RDMA.
- Automation skills using Python, Ansible, Terraform, or Bash.
- Experience in high-density power and cooling environments.
Culture & Benefits
- Opportunity to build foundational AI infrastructure from zero to scale.
- Fast-paced startup environment with a bias toward action and ownership.
- Direct impact on the foundational technology powering frontier AI workloads.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →