Training Performance Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Training Performance Engineer (AI): Optimizing large-scale foundation model training on Blackwell clusters with an accent on kernel-level performance, throughput, and cluster fabric efficiency. Focus on solving complex challenges in low-precision training, MoE parallelism, and custom attention-variant kernels to push MFU and uptime.
Location: Must be based in the Netherlands or Switzerland, with an expectation of at least 50% time in the office.
Company
is a well-funded startup building a next-generation agentic clinical AI assistant designed to support complex diagnostic workflows and clinical decision-making.
What you will do
- Instrument and analyze training runs to identify and close utilization gaps.
- Benchmark NCCL collectives over InfiniBand and NVLink to optimize fabric performance.
- Drive low-precision training initiatives and validate performance gains.
- Tune MoE parallelism strategies (TP/PP/CP/EP/DP) to optimize communication costs.
- Implement and integrate custom attention-variant kernels into the training stack.
Requirements
- Must be based in the Netherlands or Switzerland.
- Deep experience with GPU systems, including kernel-level CUDA or Triton.
- Proficiency with CUTLASS, Flash Attention, PyTorch, and Nsight profiling.
- Production experience with NCCL on high-bandwidth interconnects like InfiniBand.
- Strong understanding of parallelism strategies under memory and MFU constraints.
Nice to have
- Experience with low-precision training (FP8, dynamic loss scaling).
- Knowledge of sparse, hybrid, or MLA attention at the kernel level.
- Proven track record of shipping large-scale MoE training in production.
- Experience with Megatron or NeMo frameworks.
Culture & Benefits
- Competitive salary and pension plan.
- 25 days of annual vacation.
- EUR 1000 annual learning and development budget.
- Regular offsites and team events.
- Flexible work environment with commuting subsidy.
Hiring process
- Screening call to align on motivation and fit.
- Technical take-home assessment.
- Technical assessment debrief and collaboration discussion.
- Final onsite interview to discuss long-term alignment.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →