Principal Network Architect (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Principal Network Architect (AI Infrastructure): Designing and managing high‑performance RDMA, Infiniband, and RoCE fabrics for a global GPU cloud with an accent on reliability, scalability, and operational excellence. Focus on driving automation frameworks, solving complex cross‑layer networking issues, and defining long‑term interconnect strategies.
Location: Remote (Global). Geography is no barrier to impact or connection.
Company
is a GPU cloud provider engineered for AI, offering cost-effective, high-performance infrastructure for AI startups and large enterprise customers.
What you will do
- Lead the technical direction and operational lifecycle of high-performance RDMA network fabrics.
- Define long-term architecture, reliability strategies, and operational standards for AI interconnect networks.
- Design, build, and evolve large-scale Infiniband and RoCE fabrics across globally distributed GPU clusters.
- Develop and scale automation frameworks for network provisioning, validation, and low-touch operations.
- Drive deep debugging and resolution of complex cross-layer issues involving hardware, firmware, and kernel.
- Coordinate complex technical initiatives across Network, SRE, Compute, and Platform teams.
Requirements
- 10+ years of experience in network engineering within hyperscale, AI, or HPC environments.
- Deep expertise in RDMA, Infiniband, and/or large-scale RoCE fabrics.
- Expert-level knowledge of data center networking protocols such as BGP, OSPF, and ECMP.
- Strong programming skills in Python, Go, or similar for network automation.
- Proven ability to lead complex technical programs and act as a senior escalation point for production issues.
- Systems-level thinking to balance performance, reliability, scalability, and cost.
Nice to have
- Experience with NVIDIA / Mellanox networking platforms.
- Familiarity with distributed AI training frameworks and GPU communication patterns.
- Experience building large-scale network observability systems.
- Background in influencing infrastructure strategy in high-growth environments.
Culture & Benefits
- Competitive compensation package including base salary and equity.
- Performance and salary reviews conducted every 12 months.
- Dynamic career progression plan tailored to individual ambitions.
- Remote-first culture with Human-First Flexibility and high autonomy.
- Collaborative and innovative environment within a fast-growing tech startup.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →