Назад
Company hidden
5 дней назад

HPC Platform Engineer (AI)

Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
c1
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Platform Engineer (AI): Owning the on-prem GPU and HPC platform lifecycle with an accent on provisioning, GPU orchestration, and cross-domain integration. Focus on building a dependable, high-performance service by stitching together scheduling, networking, storage, and compute to support real-world AI training and inference workloads.

Company

hirify.global is a technology company focused on building and maintaining high-performance computing and GPU infrastructure.

What you will do

  • Manage bare-metal provisioning, OS imaging, and firmware/driver lifecycles.
  • Orchestrate GPU workloads using Kubernetes with the NVIDIA GPU Operator or Slurm.
  • Integrate scheduling, networking, storage, and compute layers into a coherent platform.
  • Ensure platform availability and predictability for compute-intensive AI workloads.
  • Drive operational excellence through automation, capacity planning, and incident response.
  • Collaborate with network and storage engineers to optimize fabric design and I/O patterns.

Requirements

  • 5+ years of experience operating production Linux infrastructure at scale in HPC or GPU environments.
  • Strong proficiency in Linux fundamentals, including kernel/driver troubleshooting and performance debugging.
  • Hands-on experience with bare-metal automation (PXE/iPXE, MAAS, Redfish) and configuration management (Ansible, Terraform).
  • Deep knowledge of GPU operations, including CUDA, NVIDIA Container Toolkit, and DCGM telemetry.
  • Experience with HPC schedulers like Slurm or Kubernetes-based GPU orchestration.
  • Fluent English required for documentation and cross-team coordination.

Nice to have

  • Experience with multi-tenant GPU-as-a-Service environments.
  • Familiarity with hybrid Slurm and Kubernetes workflows.
  • Low-level diagnostics skills (NUMA, PCIe topology, IRQ affinity).
  • Contributions to open-source HPC or Kubernetes tooling.

Culture & Benefits

  • Opportunity to work on high-performance, large-scale GPU infrastructure.
  • Focus on operational excellence and reducing technical toil through automation.
  • Collaborative environment working closely with specialized network and storage engineering teams.
  • Emphasis on disciplined incident response and measurable platform improvements.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →