Назад
Company hidden
17 часов назад

HPC Network Engineer (AI)

Тип работы
fulltime
Грейд
middle/senior
Английский
c1
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Network Engineer (AI): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, RDMA, and congestion control. Focus on building scalable InfiniBand/RoCE architectures, automating fabric provisioning, and solving complex latency and microburst challenges.

Company

hirify.global provides specialized high-performance computing and network operational services for demanding AI and compute workloads.

What you will do

  • Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
  • Configure and manage switches, HCAs, and fabric services to optimize latency and throughput.
  • Lead performance validation through benchmarking, packet analysis, and root-cause analysis of congestion.
  • Implement automation for provisioning and configuration drift detection using Python and Ansible.
  • Build observability stacks for fabric health monitoring tied to SLOs.
  • Collaborate with platform and workload teams to align network design with GPU cluster and storage needs.

Requirements

  • 3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
  • Strong expertise in InfiniBand, RDMA (verbs), and RoCE (v1/v2) implementation.
  • Proficiency in Linux networking fundamentals, including routing, VLANs, and MTU tuning.
  • Ability to automate operations using Python and/or Bash and Ansible.
  • Experience with NVIDIA/Mellanox utilities and firmware lifecycle management.
  • English: C1 (Fluent) required for technical documentation and cross-team coordination.

Nice to have

  • Experience with GPU clusters and NCCL/RDMA path awareness.
  • Knowledge of parallel storage networking (e.g., Lustre/GPFS).
  • Experience with Prometheus, Grafana, or ELK for fabric telemetry.

Culture & Benefits

  • Hands-on ownership of high-performance network fabric design and operational reliability.
  • Data-driven approach to improvements using telemetry and packet captures.
  • Collaborative environment working closely with HPC/cluster engineers and application teams.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →