HPC Network Engineer (AI)

Тип работы

fulltime

Грейд

middle/senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Network Engineer (AI): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, RDMA, and congestion control. Focus on building scalable InfiniBand/RoCE architectures, automating fabric provisioning, and solving complex latency and microburst challenges.

Company

hirify.global provides specialized high-performance computing and network operational services for demanding AI and compute workloads.

What you will do

Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
Configure and manage switches, HCAs, and fabric services to optimize latency and throughput.
Lead performance validation through benchmarking, packet analysis, and root-cause analysis of congestion.
Implement automation for provisioning and configuration drift detection using Python and Ansible.
Build observability stacks for fabric health monitoring tied to SLOs.
Collaborate with platform and workload teams to align network design with GPU cluster and storage needs.

Requirements

3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
Strong expertise in InfiniBand, RDMA (verbs), and RoCE (v1/v2) implementation.
Proficiency in Linux networking fundamentals, including routing, VLANs, and MTU tuning.
Ability to automate operations using Python and/or Bash and Ansible.
Experience with NVIDIA/Mellanox utilities and firmware lifecycle management.
English: C1 (Fluent) required for technical documentation and cross-team coordination.

Nice to have

Experience with GPU clusters and NCCL/RDMA path awareness.
Knowledge of parallel storage networking (e.g., Lustre/GPFS).
Experience with Prometheus, Grafana, or ELK for fabric telemetry.

Culture & Benefits

Hands-on ownership of high-performance network fabric design and operational reliability.
Data-driven approach to improvements using telemetry and packet captures.
Collaborative environment working closely with HPC/cluster engineers and application teams.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →