Назад
Company hidden
1 день назад

Operations Engineer (HPC Networking)

Формат работы
remote
Тип работы
fulltime
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Operations Engineer (HPC Networking): Maintaining and scaling InfiniBand and Ethernet fabrics for a generative AI infrastructure with an accent on fabric health, performance tuning, and hardware-level debugging. Focus on resolving congestion, link flaps, and NCCL stalls to ensure high-performance GPU cluster stability.

Location: Remote

Company

hirify.global is a generative media ecosystem providing the infrastructure and tools needed to scale AI-native products from idea to production.

What you will do

  • Monitor and maintain the health of InfiniBand and Ethernet fabrics, including switches, HCAs, and transceivers.
  • Investigate and resolve complex fabric issues such as connectivity, congestion, and performance regressions.
  • Support the bring-up of new fabrics in collaboration with DC operations and customer teams.
  • Execute maintenance and upgrades on switches and control plane components.
  • Collaborate with cluster operations to solve cross-domain compute and network incidents.
  • Develop tooling and runbooks to improve incident response and operational efficiency.

Requirements

  • Experience operating InfiniBand fabrics in production, including subnet manager, routing, and partitioning.
  • Ability to debug the full stack, including cables, transceivers, switch firmware, HCAs, and NCCL.
  • Proficiency in scripting using Bash, Python, or Go to automate repetitive operational work.
  • Experience with fabric bring-up from cable pulling through to validation.
  • Extreme attention to detail regarding cable plant hygiene and system reliability.

Nice to have

  • Experience with Ethernet RoCE or Spectrum-X.
  • Knowledge of large-scale GPU cluster networking.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →