Senior Principal Network Engineer (AI Infrastructure)

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Principal Network Engineer (AI Infrastructure): Designing and evolving large-scale Infiniband and RoCE fabric architectures to support high-performance GPU clusters with an accent on reliability, scalability, and long-term evolution. Focus on solving complex network incidents, improving fabric performance predictability, and defining hardware configuration standards for AI interconnects.

Location: Must be based in the US

Company

hirify.global is a GPU cloud provider engineered for AI, offering high-performance infrastructure for AI start-ups and large enterprise customers.

What you will do

Own the technical direction and operational strategy for AI interconnect networks.
Design and evolve large-scale Infiniband and RoCE fabric architectures to support growth.
Act as the senior escalation point for complex network incidents and drive systemic fixes.
Drive cross-team initiatives to improve fabric reliability and performance predictability.
Define standards for hardware configuration, routing, congestion control, and firmware management.
Mentor senior and mid-level network engineers to raise operational rigor.

Requirements

12+ years of experience in network engineering with a focus on HPC, AI, or hyperscale data centers.
Expert-level operational and architectural experience with Infiniband and/or large-scale RoCE fabrics.
Deep understanding of RDMA internals, congestion management, and fabric-level failure modes.
Strong expertise in modern data center routing and control planes (BGP, OSPF, ECMP).
Ability to debug cross-layer issues spanning hardware, firmware, kernel, and application libraries.
Must be based in the US

Nice to have

Extensive experience with NVIDIA/Mellanox networking platforms in production AI/HPC environments.
Familiarity with distributed training frameworks and GPU communication patterns.
Experience designing network observability systems for high-throughput environments.

Culture & Benefits

Highly competitive package including base salary and equity.
Performance reviews conducted every 12 months.
Flexible workplace autonomy and a human-first approach to flexibility.
Remote-first team environment with a culture of innovation and ownership.
Dynamic progression plan tailored to individual ambitions.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →