Назад
Company hidden
19 часов назад

Senior Software Engineer (AI Middleware)

Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Software Engineer (AI Middleware): Design, develop, and optimize AI communication middleware for high-performance networking in AI/HPC datacenters with an accent on enabling collective communication libraries like NCCL/RCCL over custom interconnects. Focus on profiling distributed AI workloads, tuning frameworks such as PyTorch Distributed and DeepSpeed, and contributing upstream to open-source projects.

Location: Remote for employees residing within the United States.

Company

Delivering high-performance scale-out networking solutions for AI and HPC datacenters, integrating hardware, software, and system technologies for GPU/CPU clusters.

What you will do

  • Design and implement performance-critical features for CCL enablement on hirify.global’ fabrics.
  • Optimize distributed training across multi-node, multi-GPU setups, including GPU-direct transfers and synchronization.
  • Profile AI workloads to identify bottlenecks in software/hardware stacks.
  • Tune AI frameworks like PyTorch Distributed, TensorFlow/XLA, JAX, DeepSpeed, and Megatron-LM.
  • Develop benchmarks aligned with real model performance and contribute upstream to AI projects.
  • Collaborate with kernel/driver, switch, performance, and systems teams on design reviews and escalations.

Requirements

  • Reside within the United States for remote position.
  • 8+ years in high-performance systems programming in C/C++ on Linux.
  • Strong experience with GPU communication stacks including CUDA/ROCm and NCCL/RCCL.
  • Ability to optimize distributed training using profiling and tracing.
  • Understanding of collective communication and topology awareness.
  • Experience delivering production-quality code and open-source contributions.

Nice to have

  • Experience with AI frameworks like PyTorch Distributed, DeepSpeed, Megatron-LM.
  • Familiarity with libfabric/OFI, UCX, RDMA, RoCEv2, and Ultra Ethernet.
  • Building cluster-scale performance test infrastructure.

Culture & Benefits

  • Competitive compensation with equity, cash incentives, medical/dental/vision, disability/life insurance.
  • 401(k) with company match, Open Time Off (OTO), sick time, bonding/pregnancy leave.
  • Flexible work environment with onsite, hybrid, and fully remote roles in a global team.
  • Opportunity to collaborate with leaders in the semiconductor industry.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →