1 день назад
Operations Engineer (HPC Networking)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Operations Engineer (HPC Networking): Maintaining and scaling InfiniBand and Ethernet fabrics for a generative AI infrastructure with an accent on fabric health, performance tuning, and hardware-level debugging. Focus on resolving congestion, link flaps, and NCCL stalls to ensure high-performance GPU cluster stability.
Location: Remote
Company
is a generative media ecosystem providing the infrastructure and tools needed to scale AI-native products from idea to production.
What you will do
- Monitor and maintain the health of InfiniBand and Ethernet fabrics, including switches, HCAs, and transceivers.
- Investigate and resolve complex fabric issues such as connectivity, congestion, and performance regressions.
- Support the bring-up of new fabrics in collaboration with DC operations and customer teams.
- Execute maintenance and upgrades on switches and control plane components.
- Collaborate with cluster operations to solve cross-domain compute and network incidents.
- Develop tooling and runbooks to improve incident response and operational efficiency.
Requirements
- Experience operating InfiniBand fabrics in production, including subnet manager, routing, and partitioning.
- Ability to debug the full stack, including cables, transceivers, switch firmware, HCAs, and NCCL.
- Proficiency in scripting using Bash, Python, or Go to automate repetitive operational work.
- Experience with fabric bring-up from cable pulling through to validation.
- Extreme attention to detail regarding cable plant hygiene and system reliability.
Nice to have
- Experience with Ethernet RoCE or Spectrum-X.
- Knowledge of large-scale GPU cluster networking.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →