6 часов назад
Senior Hpc Cluster Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Senior HPC Cluster Engineer: Enhancing and optimizing the core components of a Cloud platform, with an accent on GPU computing, InfiniBand networks, and the KVM/QEMU stack. Focus on analyzing, troubleshooting, and improving infrastructure to support new hardware, fine-tuning system performance, and automating fault detection and resolution in a complex system.
Location: Remote (United States)
Company
is leading a new era in cloud computing to serve the global AI economy.
What you will do
- Tune the performance of GPU clusters and InfiniBand networks to ensure optimal operation in HPC and GPU-based environments.
- Analyze and troubleshoot the root cause of issues related to GPUs and InfiniBand networks, and proposing corrective actions.
- Integrate new hardware into the existing infrastructure, including support for new GPU hardware through software stacks like Kubernetes, QEMU, and KVM.
- Enhance automation systems for proactive monitoring, detecting, and resolving issues in GPU and InfiniBand environments.
- Configure and manage GPU devices and InfiniBand fabrics, ensuring efficient and reliable operation.
Requirements
- 5+ years of professional experience in system-level software development (focused on performance optimization, low-level programming).
- 3+ years of hands-on experience with Linux systems (administration, troubleshooting, and performance tuning).
- In-depth understanding of server architecture, including PCIe devices, NICs, Linux OS/Kernel, and high-performance computing (HPC) systems.
- Strong proficiency in one or more performance-oriented programming languages (C/C++, Go, Python).
Nice to have
- Experience with GPU end-to-end testing in a cluster environment using InfiniBand networking.
- Proven track record of analyzing and optimizing the performance of HPC workloads (e.g., simulations, data analysis, AI/ML workloads).
- Familiarity with RDMA, RoCE, and InfiniBand protocols for high-performance communication.
- Understanding of QEMU/KVM virtualization and managing virtualized environments.
- Experience with deep learning frameworks such as PyTorch and TensorFlow, and their integration with HPC systems.
- Familiarity with collective communication libraries like MPI and NCCL for distributed computing.
Culture & Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within .
- Flexible working arrangements.
- A dynamic and collaborative work environment that values initiative and innovation.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →