TL;DR
Member of Technical Staff, Pre-Training Infrastructure (AI): Contributing to building a fast-moving codebase that enables training at unprecedented scale with an accent on building and optimizing the software stack for massive GPU clusters and high-throughput storage systems. Focus on profiling, benchmarking, debugging, and fine-grained optimization, demanding both engineering rigor and creativity.
Location: Must work from a designated hirify.global office at least four days a week if live within 50 miles (U.S.) or 25 miles (non-U.S.) of that location.
Salary: USD $139,900 – $274,800 per year.
Company
hirify.global’s mission is to empower every person and every organization on the planet to achieve more.
What you will do
- Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.
- Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
- Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies.
- Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, and beyond).
- Gather data and insights to develop the pretraining compute roadmap.
- Actively contribute to the development of AI models powering our innovative products.
Requirements
- Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Experience in distributed computing and large-scale systems.
- Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
- Proven ability to profile, benchmark, and optimize performance-critical systems.
- Experience in leading technical projects and supporting architectural decisions with data.
Nice to have
- Master’s Degree in Computer Science or related technical field AND 8+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR Bachelor’s Degree in Computer Science or related technical field AND 12+ years technical engineering experience with coding in languages including, but not limited to, C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Experience building infrastructure for large-scale machine learning or generative AI workloads.
- Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
- Track record of contributing to high-performance computing or large-scale AI infrastructure projects.
Culture & Benefits
- Come together with a growth mindset, innovate to empower others, and collaborate to realize our shared goals.
- Build on our values of respect, integrity, and accountability to create a culture of inclusion where everyone can thrive at work and beyond.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →