Member of Technical Staff (Infrastructure) (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member of Technical Staff (Infrastructure) (AI): Collaborate with researchers to design and scale GPU infrastructure for training and inference workloads on shared clusters with an accent on scheduling, resource allocation, and utilization optimization. Focus on building automated tooling, observability, and reliability for thousands of GPUs while managing provider relationships and adapting to evolving model architectures.
Location: Fully remote, fully-distributed async-first culture. Occasional company meetings a few times per year in London, UK or North America (LA, Toronto).
Company
AI company building world models for media and entertainment.
What you will do
- Collaborate with researchers and engineers to translate workload requirements into infrastructure decisions.
- Design and improve scheduling and resource allocation for inference and training on shared GPU clusters.
- Build, operate, and scale GPU infrastructure across clusters of thousands of GPUs.
- Own GPU utilization and cost as key metrics.
- Build automated tooling and observability to reduce friction for the AI team.
- Participate in on-call rotation and drive reliability improvements.
- Serve as primary contact for GPU providers, managing relationships and needs.
Requirements
- Deep systems foundation: Linux-native, kernel-level debugging, deep understanding of networking and storage stacks.
- Cluster engineering: Experience operating and scaling GPU infrastructure (hundreds to thousands of GPUs), Kubernetes, Slurm, distributed storage.
- Distributed systems fundamentals: Designing, building, and operating at scale.
- Production discipline: Track record of reliable infrastructure with monitoring, incident response, and automation.
- ML familiarity: Understanding of training and inference workloads for collaboration with researchers.
Nice to have
- Resource-constrained thinking: Experience in HPC, trading, or large-scale ML platforms.
Culture & Benefits
- Competitive salary and equity.
- Private health coverage; pension contribution (UK, Canada, US).
- Hardware setup of your choice; stipends for phone, internet, and meals.
- Fully-distributed, async-first culture with high ownership and dedication.
- Occasional late nights and weekends for mission-critical work.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →