Member Of Technical Staff, Training Engineer (AI)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Страна

US/Canada

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Member Of Technical Staff, Training Engineer (AI): Developing and leading end-to-end pre-training of large-scale foundation models for an autonomous AI Physicist with an accent on distributed training and high-throughput data pipelines. Focus on optimizing GPU cluster performance, implementing MoE architectures, and ensuring training stability at scale.

Location: Remote (Global)

Company

A global non-profit organization building an autonomous AI Physicist to understand the fundamental laws of the universe.

What you will do

Design and execute large-scale pre-training experiments for both dense and MoE architectures.
Build and harden high-throughput data pipelines for dataset curation, filtering, deduplication, and multimodal ingest.
Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert parallelism, and high-speed interconnects.
Optimize performance using FlashAttention-3, FP8, and custom CUDA/Triton kernels to maximize tokens/sec.
Develop observability systems to monitor throughput, gradient statistics, loss spikes, and evaluation dashboards.
Collaborate with safety and alignment teams on SFT, RLAIF, and DPO stages.

Requirements

7-12+ years of total experience, including 2+ years training large Transformers (10B-100B+ parameters).
Expert-level proficiency in PyTorch and a strong understanding of CUDA/Triton fundamentals.
Deep experience with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO) and multi-dimensional parallelism.
Proven track record of managing multi-week training runs, recovering from failures, and delivering stable checkpoints.
Strong applied mathematics background for training stability (optimization, numerics, and learning rate scaling).
Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Nice to have

Experience with MoE pre-training, including router design and load-balancing.
Accelerator-aware optimization expertise for Hopper/Blackwell architectures.
Knowledge of modern evaluation and safety techniques, including contamination detection.
Experience with inference efficiency strategies such as KV-cache and quantization.

Culture & Benefits

Mission-driven environment tackling one of the greatest scientific challenges in history.
Entrepreneurial, startup-style culture within a global non-profit structure.
Opportunity to contribute to groundbreaking research in physics and artificial intelligence.
Collaborative atmosphere working across research, infrastructure, product, and safety teams.

Hiring process

Submission of a resume, cover letter, and references.
The cover letter must explicitly include the exact role title.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →