TL;DR
Member of Technical Staff, Pre-Training Infrastructure (AI): Building and optimizing distributed training infrastructure for frontier-scale AI models with an accent on massive GPU clusters and high-throughput storage systems. Focus on scaling research recipes, implementing distributed training parallelism, and ensuring reliability and performance of supercomputing fleets.
Location: Redmond, United States. Employees are expected to work from a designated Microsoft office at least four days a week if they live within 50 miles (U.S.).
Salary: USD $139,900 – $331,200 per year
Company
hirify.global is dedicated to advancing Copilot and other consumer AI products and research.
What you will do
- Design, implement, test, and optimize distributed training infrastructure in Python and C++ for large-scale GPU clusters.
- Profile, benchmark, and debug performance bottlenecks across compute, memory, networking, and storage subsystems.
- Optimize collective communication libraries (e.g., NCCL) for emerging NVLink and InfiniBand topologies.
- Collaborate with hardware teams to optimize for next-generation accelerators.
- Gather data and insights to develop the pretraining compute roadmap.
- Actively contribute to the development of AI models powering innovative products.
Requirements
- Bachelor’s Degree in Computer Science or related technical field AND 6+ years technical engineering experience with coding in C, C++, C#, Java, JavaScript, or Python OR equivalent experience.
- Experience in distributed computing and large-scale systems.
- Experience with GPU programming (CUDA, NCCL) and frameworks such as PyTorch.
- Proven ability to profile, benchmark, and optimize performance-critical systems.
- Experience building infrastructure for large-scale machine learning or generative AI workloads.
Nice to have
- Master’s Degree in Computer Science or related technical field AND 8+ years experience OR Bachelor’s Degree AND 12+ years experience.
- Experience in leading technical projects and supporting architectural decisions with data.
- Experience in networking (InfiniBand, NVLink), storage systems, or distributed training parallelisms.
- Track record of contributing to high-performance computing or large-scale AI infrastructure projects.
Culture & Benefits
- Be part of the team shaping the future of personal computing with Copilot, Bing, Edge, and generative AI research.
- Work with a growth mindset, innovate to empower others, and collaborate to realize shared goals.
- Build on values of respect, integrity, and accountability to create a culture of inclusion.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →