Назад
Company hidden
4 часа назад

HPC Solutions Architect (AI)

225 000 - 315 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Solutions Architect (AI): Design and tune next-generation GPU clusters for AI training, simulations, and data-heavy workloads with an accent on hardware topologies, networking fabrics, and performance optimization. Focus on architecting multi-rack environments, automating GPU lifecycle management, and defining reference architectures for scalable HPC platforms.

Location: Remote from the US. Legal authorization to work in the U.S. on a full-time basis without visa sponsorship required.

Compensation: $225,000–$315,000 OTE

Company

Publicly traded AI-centric cloud provider combining GPU clusters, high-speed networks, and cloud-native tooling for enterprises, startups, and research teams.

What you will do

  • Architect and implement HPC clusters for AI, simulation, and distributed training using Kubernetes and schedulers like Slurm, considering node types, GPU topology, queues, and failure modes.
  • Integrate NVIDIA Hopper/Blackwell GPUs with NVLink/NVSwitch and InfiniBand/RoCE to match workload communication patterns.
  • Automate GPU and network lifecycle with GPU Operator and Network Operator for consistent drivers, CUDA, firmware across fleets.
  • Design cloud-native HPC environments delivering low latency, high bandwidth, and predictable scheduling, optimizing utilization and performance.
  • Define and document reference architectures for AI model training, data pipelines, MLOps, including observability and CI/CD.
  • Collaborate with NVIDIA and partners on new hardware/software evaluation; benchmark, debug bottlenecks, and lead customer design sessions.

Requirements

  • Bachelor’s or Master’s in Computer Science, Engineering, or related (PhD a plus).
  • 3+ years building/running HPC or large GPU clusters (on-prem, cloud, hybrid), owning outcomes.
  • Strong Linux, Kubernetes, container runtimes (containerd, CRI-O, Docker), CI/CD experience.
  • HPC networking/RDMA: InfiniBand, RoCE, NVLink/NVSwitch; topology and fabric design.
  • Storage/I/O for big workloads: Ceph, Lustre, NFS at scale, GPUDirect Storage.
  • Terraform, Ansible, Helm, GitOps; scripting in Python/Bash.
  • Clear communication for design reviews with engineers and stakeholders.

Nice to have

  • NVIDIA ecosystem: GPU Operator, MIG, DCGM, NCCL, Nsight, CUDA management.
  • MLflow, Kubeflow, NeMo; distributed training: PyTorch DDP, DeepSpeed, Megatron.
  • Slurm, LSF, PBS on real clusters; multi-tenant GPU environments.
  • Observability: Prometheus, DCGM Exporter, Grafana.
  • Open-source contributions in HPC, CUDA, Kubernetes.

Culture & Benefits

  • Engineering-driven culture: low bureaucracy, high ownership, focus on hard infrastructure problems.
  • 100% employer-paid medical, dental, vision for family; 4% 401(k) match with immediate vesting; disability/life insurance.
  • 20 weeks paid parental leave for primary, 12 weeks for secondary caregivers.
  • Remote-first within US with home office stipend (mobile + internet).
  • Access to top hardware: H200, B200, GB200 GPUs, NVLink/NVSwitch, InfiniBand/RoCE clusters.

Hiring process

  • HR screen.
  • Hiring manager interview.
  • Technical assignment/challenge.
  • Leadership meeting, references, background check, offer.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →