5 дней назад
HPC Platform Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
HPC Platform Engineer (AI): Owning the on-prem GPU and HPC platform lifecycle with an accent on provisioning, GPU orchestration, and cross-domain integration. Focus on building a dependable, high-performance service by stitching together scheduling, networking, storage, and compute to support real-world AI training and inference workloads.
Company
is a technology company focused on building and maintaining high-performance computing and GPU infrastructure.
What you will do
- Manage bare-metal provisioning, OS imaging, and firmware/driver lifecycles.
- Orchestrate GPU workloads using Kubernetes with the NVIDIA GPU Operator or Slurm.
- Integrate scheduling, networking, storage, and compute layers into a coherent platform.
- Ensure platform availability and predictability for compute-intensive AI workloads.
- Drive operational excellence through automation, capacity planning, and incident response.
- Collaborate with network and storage engineers to optimize fabric design and I/O patterns.
Requirements
- 5+ years of experience operating production Linux infrastructure at scale in HPC or GPU environments.
- Strong proficiency in Linux fundamentals, including kernel/driver troubleshooting and performance debugging.
- Hands-on experience with bare-metal automation (PXE/iPXE, MAAS, Redfish) and configuration management (Ansible, Terraform).
- Deep knowledge of GPU operations, including CUDA, NVIDIA Container Toolkit, and DCGM telemetry.
- Experience with HPC schedulers like Slurm or Kubernetes-based GPU orchestration.
- Fluent English required for documentation and cross-team coordination.
Nice to have
- Experience with multi-tenant GPU-as-a-Service environments.
- Familiarity with hybrid Slurm and Kubernetes workflows.
- Low-level diagnostics skills (NUMA, PCIe topology, IRQ affinity).
- Contributions to open-source HPC or Kubernetes tooling.
Culture & Benefits
- Opportunity to work on high-performance, large-scale GPU infrastructure.
- Focus on operational excellence and reducing technical toil through automation.
- Collaborative environment working closely with specialized network and storage engineering teams.
- Emphasis on disciplined incident response and measurable platform improvements.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →
Похожие вакансии
4 дня назад
Senior Platform/MLOps Engineer (AI)
150 000 - 170 000$
5 дней назад
AWS Cloud Engineer (AI)
5 дней назад
Staff Platform Engineer (AI)
4 дня назад
Senior Platform Engineer (AWS)
150 000 - 180 000$
5 дней назад
Staff Observability Platform Engineer (AI)
2 часа назад
Systems Engineer / DevOps (AI)
120 000 - 180 000$