Member Of Technical Staff, Training Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member Of Technical Staff, Training Engineer (AI): Developing and leading end-to-end pre-training of large-scale foundation models for an autonomous AI Physicist with an accent on distributed training and high-throughput data pipelines. Focus on optimizing GPU cluster performance, implementing MoE architectures, and ensuring training stability at scale.
Location: Remote (Global)
Company
A global non-profit organization building an autonomous AI Physicist to understand the fundamental laws of the universe.
What you will do
- Design and execute large-scale pre-training experiments for both dense and MoE architectures.
- Build and harden high-throughput data pipelines for dataset curation, filtering, deduplication, and multimodal ingest.
- Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert parallelism, and high-speed interconnects.
- Optimize performance using FlashAttention-3, FP8, and custom CUDA/Triton kernels to maximize tokens/sec.
- Develop observability systems to monitor throughput, gradient statistics, loss spikes, and evaluation dashboards.
- Collaborate with safety and alignment teams on SFT, RLAIF, and DPO stages.
Requirements
- 7-12+ years of total experience, including 2+ years training large Transformers (10B-100B+ parameters).
- Expert-level proficiency in PyTorch and a strong understanding of CUDA/Triton fundamentals.
- Deep experience with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO) and multi-dimensional parallelism.
- Proven track record of managing multi-week training runs, recovering from failures, and delivering stable checkpoints.
- Strong applied mathematics background for training stability (optimization, numerics, and learning rate scaling).
- Bachelor's or Master's degree in Computer Science, Engineering, or a related field.
Nice to have
- Experience with MoE pre-training, including router design and load-balancing.
- Accelerator-aware optimization expertise for Hopper/Blackwell architectures.
- Knowledge of modern evaluation and safety techniques, including contamination detection.
- Experience with inference efficiency strategies such as KV-cache and quantization.
Culture & Benefits
- Mission-driven environment tackling one of the greatest scientific challenges in history.
- Entrepreneurial, startup-style culture within a global non-profit structure.
- Opportunity to contribute to groundbreaking research in physics and artificial intelligence.
- Collaborative atmosphere working across research, infrastructure, product, and safety teams.
Hiring process
- Submission of a resume, cover letter, and references.
- The cover letter must explicitly include the exact role title.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →