Member of Engineering (Scalability, AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member of Engineering (Scalability) (AI/LLMs): Building distributed training and inference infrastructure for Large Language Models with an accent on software reliability, fault tolerance, and hardware fault detection. Focus on cross-platform checkpointing, NCCL recovery, minimizing GPU idle time during faults, and developing tools for training recovery.
Location: Remote (EMEA/East Coast). Monthly in-person collaboration in Paris (Mon-Wed, optional).
Company
aims to reach AGI by accelerating software development with agentic AI systems and frontier models deployed into enterprise development environments.
What you will do
- Identify, study, and troubleshoot hardware problems during large-scale training.
- Minimize GPU idle time during faults operationally and strategically.
- Design and develop tools and add-ons to accelerate training recovery.
- Improve performance and reliability of checkpointing.
- Write high-quality Python (PyTorch), Cython, C/C++, and CUDA code.
Requirements
- Understanding of LLMs, Transformers, and deep learning fundamentals.
- Strong engineering background with Linux API/kernel experience.
- Programming: Python (numpy, PyTorch/Jax), C/C++, NCCL, strong algorithms.
- Distributed systems: reliability, observability, fault-tolerance, K8s.
- Fast learner ready for steep curve, modern tools, critical thinking.
Culture & Benefits
- Fully remote with flexible hours.
- 37 days/year vacation & holidays.
- Health insurance allowance for you & dependents.
- Company equipment, well-being/learning/home office allowances.
- Frequent team get-togethers, diverse inclusive culture.
Hiring process
- Intro call with Founding Engineer.
- Technical interview(s) with Founding Engineer.
- Team fit call with People team.
- Final interview with Founding Engineer.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →