Senior Machine Learning Infrastructure Engineer
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Machine Learning Infrastructure Engineer (AI Simulation): Extend and operate infrastructure powering research model training, fine-tuning, and serving pipelines with an accent on distributed training for neural operators, data I/O optimization, and model deployment. Focus on designing scalable systems on NVIDIA DGX platforms, solving bottlenecks in large-scale mesh datasets, and building reliable serving infrastructure with uncertainty quantification.
Hybrid setup in Shoreditch office, London, United Kingdom with remote flexibility.
Company
Deep-tech company with roots in numerical physics and Formula One, building AI-driven simulation software stack for engineering and manufacturing in Aerospace & Defense, Materials, Energy, Semiconductors, and Automotive.
What you will do
- Design and operate distributed training infrastructure for neural operator architectures on NVIDIA DGX B200, optimizing for throughput, fault tolerance, and cost efficiency.
- Build experiment tracking and observability systems for training runs, hyperparameter sweeps, and model performance.
- Solve data loading bottlenecks for large-scale mesh datasets and optimize I/O pipelines from cloud storage.
- Develop serving infrastructure for pre-trained Large Physics Models, supporting zero-shot inference and uncertainty quantification.
- Implement model packaging pipelines for reliable customer deployment with fine-tuning capabilities and reproducibility.
- Improve developer experience with fast iteration cycles, reliable CI/CD, and debugging tools, collaborating on shared infrastructure standards.
Requirements
- 5+ years building and operating ML infrastructure at scale, with deep expertise in distributed training (NCCL, FSDP/DDP/pipeline parallelism).
- Strong systems fundamentals: Linux, networking (NVLink, InfiniBand), storage I/O, profiling, and performance optimization.
- Production experience with Kubernetes and SLURM for GPU cluster orchestration.
- Proficiency in Python and ML frameworks (PyTorch preferred).
- Experience with cloud GPU infrastructure (e.g., CoreWeave).
- Excellent collaboration and communication skills especially in research settings, ability to scope projects and solve problems quickly.
Nice to have
- Experience with geometric deep learning, neural operators on meshes/point clouds/graphs.
- Background in HPC for simulation engineering (CFD/FEA workflows).
- Experience building model serving with latency/throughput requirements.
- Familiarity with experiment tracking (Weights & Biases, MLflow) and observability (Prometheus, Grafana).
- Experience packaging models for customer environments (containers, registries, versioning).
Culture & Benefits
- Equity options and 10% employer pension contribution.
- Free office lunches, flexible working, and hybrid setup with remote flexibility including work from anywhere perks.
- Enhanced parental leave and private healthcare.
- Personal development with access to learning and training.
- Commitment to diversity, equal opportunity, and sponsoring women from disadvantaged backgrounds in STEM.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →