Назад
Company hidden
4 дня назад

HPC Network Engineer (InfiniBand)

Формат работы
remote
Тип работы
fulltime
Грейд
middle/senior
Английский
c1
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Network Engineer (InfiniBand/RoCE): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, lossless behavior, and congestion control. Focus on RDMA performance engineering, fabric observability, and automating network provisioning for large-scale compute environments.

Location: Remote

Company

hirify.global provides specialized operations and design for high-performance computing environments.

What you will do

  • Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
  • Configure and operate switches, HCAs, and fabric services to meet strict workload requirements for latency and throughput.
  • Lead performance validation using benchmarking, packet/flow analysis, and microburst detection.
  • Drive reliability through redundancy planning, firmware lifecycle management, and post-incident root-cause analysis.
  • Build automation for provisioning and configuration drift detection using Python and Ansible.
  • Implement observability stacks with telemetry, logs, and alerting tied to fabric health SLOs.

Requirements

  • 3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
  • Strong Linux networking fundamentals, including routing, bridging, VLANs, and MTU/jumbo frames.
  • Production experience designing and operating InfiniBand fabrics and subnet management.
  • Practical RDMA and RoCE (v1/v2) knowledge, including DCB/PFC/ECN configuration.
  • Proficiency with Python and/or Bash, and experience with Ansible for operational automation.
  • Fluent English for technical documentation, change plans, and cross-team coordination.

Nice to have

  • Experience with GPU clusters and performance-sensitive workloads (MPI tuning, NCCL/RDMA).
  • Familiarity with parallel storage networking patterns such as Lustre or GPFS.
  • Exposure to telemetry stacks like Prometheus, Grafana, ELK, or OpenSearch.

Culture & Benefits

  • Fully remote work arrangement.
  • Opportunity to work with cutting-edge AI and HPC infrastructure.
  • Environment focused on operational excellence and proactive system design.
  • Close collaboration with platform and workload engineering teams.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →