Назад
Company hidden
2 дня назад

Member Of Technical Staff, Training Engineer (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US/Canada
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Member Of Technical Staff, Training Engineer (AI): Developing and leading end-to-end pre-training of large-scale foundation models for an autonomous AI Physicist with an accent on distributed training and high-throughput data pipelines. Focus on optimizing GPU cluster performance, implementing MoE architectures, and ensuring training stability at scale.

Location: Remote (Global)

Company

A global non-profit organization building an autonomous AI Physicist to understand the fundamental laws of the universe.

What you will do

  • Design and execute large-scale pre-training experiments for both dense and MoE architectures.
  • Build and harden high-throughput data pipelines for dataset curation, filtering, deduplication, and multimodal ingest.
  • Operate distributed training infrastructure using FSDP/ZeRO, tensor/pipeline/expert parallelism, and high-speed interconnects.
  • Optimize performance using FlashAttention-3, FP8, and custom CUDA/Triton kernels to maximize tokens/sec.
  • Develop observability systems to monitor throughput, gradient statistics, loss spikes, and evaluation dashboards.
  • Collaborate with safety and alignment teams on SFT, RLAIF, and DPO stages.

Requirements

  • 7-12+ years of total experience, including 2+ years training large Transformers (10B-100B+ parameters).
  • Expert-level proficiency in PyTorch and a strong understanding of CUDA/Triton fundamentals.
  • Deep experience with distributed frameworks (PyTorch FSDP or DeepSpeed ZeRO) and multi-dimensional parallelism.
  • Proven track record of managing multi-week training runs, recovering from failures, and delivering stable checkpoints.
  • Strong applied mathematics background for training stability (optimization, numerics, and learning rate scaling).
  • Bachelor's or Master's degree in Computer Science, Engineering, or a related field.

Nice to have

  • Experience with MoE pre-training, including router design and load-balancing.
  • Accelerator-aware optimization expertise for Hopper/Blackwell architectures.
  • Knowledge of modern evaluation and safety techniques, including contamination detection.
  • Experience with inference efficiency strategies such as KV-cache and quantization.

Culture & Benefits

  • Mission-driven environment tackling one of the greatest scientific challenges in history.
  • Entrepreneurial, startup-style culture within a global non-profit structure.
  • Opportunity to contribute to groundbreaking research in physics and artificial intelligence.
  • Collaborative atmosphere working across research, infrastructure, product, and safety teams.

Hiring process

  • Submission of a resume, cover letter, and references.
  • The cover letter must explicitly include the exact role title.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →