TL;DR
Software Engineer (GPU Networking): Building and optimizing high-performance GPU networking and distributed systems for AI inference with an accent on integrating RDMA capabilities and co-optimizing communication alongside computation. Focus on architecting the software fabric that unifies thousands of GPUs, enabling serverless-grade startup speeds for LLMs, and deep-diving into bleeding-edge hardware performance.
Location: Onsite in San Francisco, US
Salary: $150,000–$250,000 annually, with equity
Company
hirify.global is a fast-growing product company that powers mission-critical AI inference for leading AI companies.
What you will do
- Integrate RDMA/RoCE/InfiniBand capabilities directly into the inference stack to achieve order-of-magnitude improvements in bandwidth and latency.
- Implement and tune networking layers for efficient Disaggregated KV Cache Offload and Wide Expert Parallelism (WideEP) for MoE models.
- Enable sub-10-second startup for trillion-parameter models by working deeply with checkpointing and storage mechanisms.
- Characterize and validate networking performance on bleeding-edge GPU clusters (H100/H200, B200/B300, GB200/300 NVL72).
- Design tools to visualize packet flow, congestion, and effective bandwidth across GPU interconnects for diagnosing distributed system behaviors.
- Work with communication libraries (NCCL, NVSHMEM) and potentially write custom communication kernels to overlap compute and data transfer.
Requirements
- Deep experience with high-performance networking protocols (InfiniBand, RoCE v2).
- Proficiency in C++ or Python, with the ability to bridge high-level logic and hardware.
- Deep understanding of the memory hierarchy in modern NVIDIA architectures (H100/Blackwell) and optimization skills.
- Ability to deep-dive into TensorRT-LLM source code, write custom C++/Python bindings, or debug NVLink topology issues.
- Proven ability to build custom solutions when off-the-shelf tools are insufficient for performance needs.
- Work onsite in San Francisco, US.
Nice to have
- Deep knowledge of NCCL, NVSHMEM, and UCX.
- Experience with GPUDirect Storage (GDS) or high-performance filesystems like Weka or 3FS.
- Familiarity with TensorRT-LLM, vLLM, or Sglang.
- Experience running low-level benchmarks to qualify new hardware clusters.
Culture & Benefits
- Competitive compensation, including meaningful equity.
- 100% coverage of medical, dental, and vision insurance for employee and dependents.
- Generous PTO policy, including a company-wide Winter Break.
- Paid parental leave.
- Company-facilitated 401(k).
- Exposure to a variety of ML startups, offering unparalleled learning and networking opportunities.
- Opportunity to work with bleeding-edge hardware like Blackwell (B200/B300) and Rubin architectures.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →