Software Engineer, TT-Distributed (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Software Engineer, TT-Distributed (AI): Developing and optimizing distributed software systems that power high-performance AI and HPC clusters with an accent on inter-node communication and scalable architectures. Focus on designing distributed APIs for data-parallel and tensor-parallel workloads and scaling programming models across multiple compute nodes.
Location: Hybrid; must be based in Santa Clara (CA), Austin (TX), or Toronto (ON)
Salary: $100k - $500k (including base and variable compensation)
Company
is leading the industry on cutting-edge AI technology, revolutionizing performance expectations, ease of use, and cost efficiency.
What you will do
- Architect and optimize distributed software systems that coordinate computation across clusters of AI accelerators and CPUs.
- Design and build distributed APIs enabling data-parallel and tensor-parallel AI workloads.
- Scale programming models across multiple hosts and compute nodes using MPI-based technologies and frameworks.
- Develop robust systems using IPC, inter-node sockets, and distributed communication primitives.
- Build and maintain testing, debugging, profiling, and monitoring tools for large-scale distributed workloads.
Requirements
- Strong proficiency in C or C++ with solid foundations in systems programming and operating systems.
- Deep understanding of distributed systems principles, including IPC, socket programming, and cluster resource coordination.
- Ability to reason about scalability, fault tolerance, and performance in multi-node environments.
- First-principles thinking and motivation to become a technical expert in large-scale AI infrastructure.
- Must be eligible to access U.S. export-controlled technology (compliance with EAR).
Culture & Benefits
- Highly competitive compensation package and comprehensive benefits.
- Opportunity to work with next-generation RISC-V CPU and AI platform technology.
- Collaborative environment focused on solving hard technical problems and challenging conventional designs.
- Direct exposure to the architecture of large-scale distributed inference and training systems.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →