HPC Solutions Architect (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
HPC Solutions Architect (AI): Design and tune next-generation GPU clusters for AI training, simulations, and data-heavy workloads with an accent on hardware topologies, networking fabrics, and performance optimization. Focus on architecting multi-rack environments, automating GPU lifecycle management, and defining reference architectures for scalable HPC platforms.
Location: Remote from the US. Legal authorization to work in the U.S. on a full-time basis without visa sponsorship required.
Compensation: $225,000–$315,000 OTE
Company
Publicly traded AI-centric cloud provider combining GPU clusters, high-speed networks, and cloud-native tooling for enterprises, startups, and research teams.
What you will do
- Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm, considering node types, GPU topology, queues, and failure modes.
- Integrate NVIDIA Hopper/Blackwell GPUs with NVLink/NVSwitch and InfiniBand/RoCE to match workload communication patterns.
- Automate GPU and network lifecycle with GPU Operator and Network Operator for consistent drivers, CUDA, firmware across fleets.
- Design cloud-native HPC environments delivering low latency, high bandwidth, and predictable scheduling, optimizing utilization and performance.
- Define and document reference architectures for AI model training, data pipelines, MLOps, including observability and CI/CD.
- Collaborate with NVIDIA and partners on new hardware/software evaluation; benchmark, debug bottlenecks, and lead customer design sessions.
Requirements
- Bachelor’s or Master’s in Computer Science, Engineering, or related (PhD a plus).
- 3+ years building/running HPC or large GPU clusters (on-prem, cloud, hybrid), owning outcomes.
- Strong Linux, Kubernetes, container runtimes (containerd, CRI-O, Docker), CI/CD experience.
- HPC networking/RDMA: InfiniBand, RoCE, NVLink/NVSwitch; topology and fabric design.
- Storage/I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage.
- Terraform, Ansible, Helm, GitOps; scripting in Python/Bash.
- Clear communication for design reviews with engineers and stakeholders.
Nice to have
- NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, CUDA management.
- MLflow, Kubeflow, NeMo; distributed training: PyTorch DDP, DeepSpeed, Megatron.
- Slurm, LSF, PBS on real clusters; multi-tenant GPU environments.
- Observability: Prometheus, DCGM Exporter, Grafana.
- Open-source contributions in HPC, CUDA, Kubernetes.
Culture & Benefits
- Engineering-driven culture: low bureaucracy, high ownership, focus on hard infrastructure problems.
- 100% employer-paid medical, dental, vision for family; 4% 401(k) match with immediate vesting; disability/life insurance.
- 20 weeks paid parental leave for primary, 12 weeks for secondary caregivers.
- Remote-first within US with home office stipend (mobile + internet).
- Access to top hardware: H200, B200, GB200 GPUs, NVLink/NVSwitch, InfiniBand/RoCE clusters.
Hiring process
- HR screen.
- Hiring manager interview.
- Technical assignment/challenge.
- Leadership meeting, references, background check, offer.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →