TL;DR
Member Of Technical Staff Pre-Training Infrastructure (AI): Develop and optimize distributed training infrastructure for large-scale GPU clusters powering AI models like Copilot with an accent on performance optimization, distributed training parallelism, and large-scale system reliability. Focus on designing and debugging high-throughput storage, networking, and compute subsystems to support frontier-scale AI research and supercomputing.
Location: New York, United States, onsite with expectation to work from office at least four days a week if living within 50 miles
Salary: $220,800–$331,200 per year (New York City metropolitan area range)
Company
hirify.global is dedicated to advancing consumer AI products and research including Copilot, Bing, Edge, and generative AI.
What you will do
- Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large GPU clusters.
- Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
- Optimize collective communication libraries for emerging hardware topologies like NVLink and InfiniBand.
- Collaborate with hardware teams to optimize for next-generation accelerators (NVIDIA, AMD, etc.).
- Develop pretraining compute roadmap based on data and insights.
- Contribute to AI model development powering innovative products.
Requirements
- Location: Must be based in or near New York with onsite work expectation
- Bachelor’s degree in Computer Science or related field with 6+ years of technical engineering experience or equivalent.
- Experience in distributed computing, GPU programming (CUDA, NCCL), and frameworks like PyTorch.
- Proven ability to profile, benchmark, and optimize performance-critical systems.
- Experience leading technical projects and supporting architectural decisions.
- Experience building infrastructure for large-scale machine learning or generative AI workloads.
Nice to have
- Master’s degree or equivalent experience with 8+ years technical engineering experience.
- Experience with networking (InfiniBand, NVLink), storage systems, and distributed training parallelisms.
- Track record in high-performance computing or large-scale AI infrastructure projects.
Culture & Benefits
- Work in a fast-paced, design-driven product development cycle.
- Collaborate with a team dedicated to advancing AI and personal computing.
- Embody values of respect, integrity, accountability, and inclusion.
- Opportunity to work on cutting-edge AI infrastructure and research.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →