Senior Principal Network Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Principal Network Engineer (AI Infrastructure): Designing and evolving large-scale Infiniband and RoCE fabric architectures to support high-performance GPU clusters with an accent on reliability, scalability, and long-term evolution. Focus on solving complex network incidents, improving fabric performance predictability, and defining hardware configuration standards for AI interconnects.
Location: Must be based in the US
Company
is a GPU cloud provider engineered for AI, offering high-performance infrastructure for AI start-ups and large enterprise customers.
What you will do
- Own the technical direction and operational strategy for AI interconnect networks.
- Design and evolve large-scale Infiniband and RoCE fabric architectures to support growth.
- Act as the senior escalation point for complex network incidents and drive systemic fixes.
- Drive cross-team initiatives to improve fabric reliability and performance predictability.
- Define standards for hardware configuration, routing, congestion control, and firmware management.
- Mentor senior and mid-level network engineers to raise operational rigor.
Requirements
- 12+ years of experience in network engineering with a focus on HPC, AI, or hyperscale data centers.
- Expert-level operational and architectural experience with Infiniband and/or large-scale RoCE fabrics.
- Deep understanding of RDMA internals, congestion management, and fabric-level failure modes.
- Strong expertise in modern data center routing and control planes (BGP, OSPF, ECMP).
- Ability to debug cross-layer issues spanning hardware, firmware, kernel, and application libraries.
- Must be based in the US
Nice to have
- Extensive experience with NVIDIA/Mellanox networking platforms in production AI/HPC environments.
- Familiarity with distributed training frameworks and GPU communication patterns.
- Experience designing network observability systems for high-throughput environments.
Culture & Benefits
- Highly competitive package including base salary and equity.
- Performance reviews conducted every 12 months.
- Flexible workplace autonomy and a human-first approach to flexibility.
- Remote-first team environment with a culture of innovation and ownership.
- Dynamic progression plan tailored to individual ambitions.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →