HPC Platform Engineer (AI)

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Platform Engineer (AI): Owning the on-prem GPU and HPC platform lifecycle with an accent on provisioning, GPU orchestration, and cross-domain integration. Focus on building a dependable, high-performance service by stitching together scheduling, networking, storage, and compute to support real-world AI training and inference workloads.

Company

hirify.global is a technology company focused on building and maintaining high-performance computing and GPU infrastructure.

What you will do

Manage bare-metal provisioning, OS imaging, and firmware/driver lifecycles.
Orchestrate GPU workloads using Kubernetes with the NVIDIA GPU Operator or Slurm.
Integrate scheduling, networking, storage, and compute layers into a coherent platform.
Ensure platform availability and predictability for compute-intensive AI workloads.
Drive operational excellence through automation, capacity planning, and incident response.
Collaborate with network and storage engineers to optimize fabric design and I/O patterns.

Requirements

5+ years of experience operating production Linux infrastructure at scale in HPC or GPU environments.
Strong proficiency in Linux fundamentals, including kernel/driver troubleshooting and performance debugging.
Hands-on experience with bare-metal automation (PXE/iPXE, MAAS, Redfish) and configuration management (Ansible, Terraform).
Deep knowledge of GPU operations, including CUDA, NVIDIA Container Toolkit, and DCGM telemetry.
Experience with HPC schedulers like Slurm or Kubernetes-based GPU orchestration.
Fluent English required for documentation and cross-team coordination.

Nice to have

Experience with multi-tenant GPU-as-a-Service environments.
Familiarity with hybrid Slurm and Kubernetes workflows.
Low-level diagnostics skills (NUMA, PCIe topology, IRQ affinity).
Contributions to open-source HPC or Kubernetes tooling.

Culture & Benefits

Opportunity to work on high-performance, large-scale GPU infrastructure.
Focus on operational excellence and reducing technical toil through automation.
Collaborative environment working closely with specialized network and storage engineering teams.
Emphasis on disciplined incident response and measurable platform improvements.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →