HPC Solutions Architect (AI)

225 000 - 315 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Solutions Architect (AI): Design and tune next-generation GPU clusters for AI training, simulations, and data-heavy workloads with an accent on hardware topologies, networking fabrics, and performance optimization. Focus on architecting multi-rack environments, automating GPU lifecycle management, and defining reference architectures for scalable HPC platforms.

Location: Remote from the US. Legal authorization to work in the U.S. on a full-time basis without visa sponsorship required.

Compensation: $225,000–$315,000 OTE

Company

Publicly traded AI-centric cloud provider combining GPU clusters, high-speed networks, and cloud-native tooling for enterprises, startups, and research teams.

What you will do

Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm, considering node types, GPU topology, queues, and failure modes.
Integrate NVIDIA Hopper/Blackwell GPUs with NVLink/NVSwitch and InfiniBand/RoCE to match workload communication patterns.
Automate GPU and network lifecycle with GPU Operator and Network Operator for consistent drivers, CUDA, firmware across fleets.
Design cloud-native HPC environments delivering low latency, high bandwidth, and predictable scheduling, optimizing utilization and performance.
Define and document reference architectures for AI model training, data pipelines, MLOps, including observability and CI/CD.
Collaborate with NVIDIA and partners on new hardware/software evaluation; benchmark, debug bottlenecks, and lead customer design sessions.

Requirements

Bachelor’s or Master’s in Computer Science, Engineering, or related (PhD a plus).
3+ years building/running HPC or large GPU clusters (on-prem, cloud, hybrid), owning outcomes.
Strong Linux, Kubernetes, container runtimes (containerd, CRI-O, Docker), CI/CD experience.
HPC networking/RDMA: InfiniBand, RoCE, NVLink/NVSwitch; topology and fabric design.
Storage/I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage.
Terraform, Ansible, Helm, GitOps; scripting in Python/Bash.
Clear communication for design reviews with engineers and stakeholders.

Nice to have

NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, CUDA management.
MLflow, Kubeflow, NeMo; distributed training: PyTorch DDP, DeepSpeed, Megatron.
Slurm, LSF, PBS on real clusters; multi-tenant GPU environments.
Observability: Prometheus, DCGM Exporter, Grafana.
Open-source contributions in HPC, CUDA, Kubernetes.

Culture & Benefits

Engineering-driven culture: low bureaucracy, high ownership, focus on hard infrastructure problems.
100% employer-paid medical, dental, vision for family; 4% 401(k) match with immediate vesting; disability/life insurance.
20 weeks paid parental leave for primary, 12 weeks for secondary caregivers.
Remote-first within US with home office stipend (mobile + internet).
Access to top hardware: H200, B200, GB200 GPUs, NVLink/NVSwitch, InfiniBand/RoCE clusters.

Hiring process

HR screen.
Hiring manager interview.
Technical assignment/challenge.
Leadership meeting, references, background check, offer.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

HPC Solutions Architect (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Hiring process

Похожие вакансии

Senior Solutions Architect (AI)

Resident Solutions Architect (AI)

Cloud Solution Architect (AWS, Azure, GCP)

Resident Solutions Architect (AI)

Data Architect and Strategy (AI)

Resident Solutions Architect (AI)