4 дня назад
HPC Network Engineer (InfiniBand)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
HPC Network Engineer (InfiniBand/RoCE): Designing and optimizing high-performance network fabrics for HPC and AI workloads with an accent on low-latency, lossless behavior, and congestion control. Focus on RDMA performance engineering, fabric observability, and automating network provisioning for large-scale compute environments.
Location: Remote
Company
provides specialized operations and design for high-performance computing environments.
What you will do
- Design and evolve InfiniBand and Ethernet/RoCE architectures, including spine-leaf and rail-optimized fabrics.
- Configure and operate switches, HCAs, and fabric services to meet strict workload requirements for latency and throughput.
- Lead performance validation using benchmarking, packet/flow analysis, and microburst detection.
- Drive reliability through redundancy planning, firmware lifecycle management, and post-incident root-cause analysis.
- Build automation for provisioning and configuration drift detection using Python and Ansible.
- Implement observability stacks with telemetry, logs, and alerting tied to fabric health SLOs.
Requirements
- 3–7+ years of network engineering experience in HPC, low-latency trading, or large-scale compute environments.
- Strong Linux networking fundamentals, including routing, bridging, VLANs, and MTU/jumbo frames.
- Production experience designing and operating InfiniBand fabrics and subnet management.
- Practical RDMA and RoCE (v1/v2) knowledge, including DCB/PFC/ECN configuration.
- Proficiency with Python and/or Bash, and experience with Ansible for operational automation.
- Fluent English for technical documentation, change plans, and cross-team coordination.
Nice to have
- Experience with GPU clusters and performance-sensitive workloads (MPI tuning, NCCL/RDMA).
- Familiarity with parallel storage networking patterns such as Lustre or GPFS.
- Exposure to telemetry stacks like Prometheus, Grafana, ELK, or OpenSearch.
Culture & Benefits
- Fully remote work arrangement.
- Opportunity to work with cutting-edge AI and HPC infrastructure.
- Environment focused on operational excellence and proactive system design.
- Close collaboration with platform and workload engineering teams.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →