HPC Network Engineer (InfiniBand)

Формат работы

remote

Тип работы

fulltime

Грейд

middle/senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Network Engineer (InfiniBand/RoCE): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, lossless behavior, and congestion control. Focus on RDMA performance engineering, fabric observability, and automating network provisioning for large-scale compute environments.

Location: Remote

Company

hirify.global provides specialized operations and design for high-performance computing environments.

What you will do

Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
Configure and operate switches, HCAs, and fabric services to meet strict workload requirements for latency and throughput.
Lead performance validation using benchmarking, packet/flow analysis, and microburst detection.
Drive reliability through redundancy planning, firmware lifecycle management, and post-incident root-cause analysis.
Build automation for provisioning and configuration drift detection using Python and Ansible.
Implement observability stacks with telemetry, logs, and alerting tied to fabric health SLOs.

Requirements

3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
Strong Linux networking fundamentals, including routing, bridging, VLANs, and MTU/jumbo frames.
Production experience designing and operating InfiniBand fabrics and subnet management.
Practical RDMA and RoCE (v1/v2) knowledge, including DCB/PFC/ECN configuration.
Proficiency with Python and/or Bash, and experience with Ansible for operational automation.
Fluent English for technical documentation, change plans, and cross-team coordination.

Nice to have

Experience with GPU clusters and performance-sensitive workloads (MPI tuning, NCCL/RDMA).
Familiarity with parallel storage networking patterns such as Lustre or GPFS.
Exposure to telemetry stacks like Prometheus, Grafana, ELK, or OpenSearch.

Culture & Benefits

Fully remote work arrangement.
Opportunity to work with cutting-edge AI and HPC infrastructure.
Environment focused on operational excellence and proactive system design.
Close collaboration with platform and workload engineering teams.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →