Senior Machine Learning Infrastructure Engineer

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Machine Learning Infrastructure Engineer (AI Simulation): Extend and operate infrastructure powering research model training, fine-tuning, and serving pipelines with an accent on distributed training for neural operators, data I/O optimization, and model deployment. Focus on designing scalable systems on NVIDIA DGX platforms, solving bottlenecks in large-scale mesh datasets, and building reliable serving infrastructure with uncertainty quantification.

Hybrid setup in Shoreditch office, London, United Kingdom with remote flexibility.

Company

Deep-tech company with roots in numerical physics and Formula One, building AI-driven simulation software stack for engineering and manufacturing in Aerospace & Defense, Materials, Energy, Semiconductors, and Automotive.

What you will do

Design and operate distributed training infrastructure for neural operator architectures on NVIDIA DGX B200, optimizing for throughput, fault tolerance, and cost efficiency.
Build experiment tracking and observability systems for training runs, hyperparameter sweeps, and model performance.
Solve data loading bottlenecks for large-scale mesh datasets and optimize I/O pipelines from cloud storage.
Develop serving infrastructure for pre-trained Large Physics Models, supporting zero-shot inference and uncertainty quantification.
Implement model packaging pipelines for reliable customer deployment with fine-tuning capabilities and reproducibility.
Improve developer experience with fast iteration cycles, reliable CI/CD, and debugging tools, collaborating on shared infrastructure standards.

Requirements

5+ years building and operating ML infrastructure at scale, with deep expertise in distributed training (NCCL, FSDP/DDP/pipeline parallelism).
Strong systems fundamentals: Linux, networking (NVLink, InfiniBand), storage I/O, profiling, and performance optimization.
Production experience with Kubernetes and SLURM for GPU cluster orchestration.
Proficiency in Python and ML frameworks (PyTorch preferred).
Experience with cloud GPU infrastructure (e.g., CoreWeave).
Excellent collaboration and communication skills especially in research settings, ability to scope projects and solve problems quickly.

Nice to have

Experience with geometric deep learning, neural operators on meshes/point clouds/graphs.
Background in HPC for simulation engineering (CFD/FEA workflows).
Experience building model serving with latency/throughput requirements.
Familiarity with experiment tracking (Weights & Biases, MLflow) and observability (Prometheus, Grafana).
Experience packaging models for customer environments (containers, registries, versioning).

Culture & Benefits

Equity options and 10% employer pension contribution.
Free office lunches, flexible working, and hybrid setup with remote flexibility including work from anywhere perks.
Enhanced parental leave and private healthcare.
Personal development with access to learning and training.
Commitment to diversity, equal opportunity, and sponsoring women from disadvantaged backgrounds in STEM.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Senior Machine Learning Infrastructure Engineer

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Senior AI Platform Engineer (AI)

Member of Technical Staff, Research Engineer (AI)

Staff Machine Learning Engineer (Simulation)

Head of AI (Fintech)

AI Software Engineer III (AI)

AI Engineer (Fintech)