17 часов назад
HPC Network Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
HPC Network Engineer (AI): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, RDMA, and congestion control. Focus on building scalable InfiniBand/RoCE architectures, automating fabric provisioning, and solving complex latency and microburst challenges.
Company
provides specialized high-performance computing and network operational services for demanding AI and compute workloads.
What you will do
- Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
- Configure and manage switches, HCAs, and fabric services to optimize latency and throughput.
- Lead performance validation through benchmarking, packet analysis, and root-cause analysis of congestion.
- Implement automation for provisioning and configuration drift detection using Python and Ansible.
- Build observability stacks for fabric health monitoring tied to SLOs.
- Collaborate with platform and workload teams to align network design with GPU cluster and storage needs.
Requirements
- 3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
- Strong expertise in InfiniBand, RDMA (verbs), and RoCE (v1/v2) implementation.
- Proficiency in Linux networking fundamentals, including routing, VLANs, and MTU tuning.
- Ability to automate operations using Python and/or Bash and Ansible.
- Experience with NVIDIA/Mellanox utilities and firmware lifecycle management.
- English: C1 (Fluent) required for technical documentation and cross-team coordination.
Nice to have
- Experience with GPU clusters and NCCL/RDMA path awareness.
- Knowledge of parallel storage networking (e.g., Lustre/GPFS).
- Experience with Prometheus, Grafana, or ELK for fabric telemetry.
Culture & Benefits
- Hands-on ownership of high-performance network fabric design and operational reliability.
- Data-driven approach to improvements using telemetry and packet captures.
- Collaborative environment working closely with HPC/cluster engineers and application teams.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →