Назад
Company hidden
6 часов назад

Senior Hpc Cluster Engineer

Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior HPC Cluster Engineer: Enhancing and optimizing the core components of a Cloud platform, with an accent on GPU computing, InfiniBand networks, and the KVM/QEMU stack. Focus on analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.

Location: Remote (United States)

Company

hirify.global is leading a new era in cloud computing to serve the global AI economy.

What you will do

  • Tune the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
  • Analyze and troubleshoot the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
  • Integrate new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
  • Enhance automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
  • Configure and manage GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.

Requirements

  • 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
  • 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
  • In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
  • Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).

Nice to have

  • Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
  • Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
  • Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
  • Understanding of QEMU/KVM virtualization and managing virtualized environments.
  • Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
  • Familiarity with collective communication libraries like MPI and NCCL for distributed computing.

Culture & Benefits

  • Competitive salary and comprehensive benefits package.
  • Opportunities for professional growth within hirify.global.
  • Flexible working arrangements.
  • A dynamic and collaborative work environment that values initiative and innovation.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →