Назад
Company hidden
2 дня назад

Senior Machine Learning Infrastructure Engineer

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
UK
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Machine Learning Infrastructure Engineer (AI Simulation): Extend and operate infrastructure powering research model training, fine-tuning, and serving pipelines with an accent on distributed training for neural operators, data I/O optimization, and model deployment. Focus on designing scalable systems on NVIDIA DGX platforms, solving bottlenecks in large-scale mesh datasets, and building reliable serving infrastructure with uncertainty quantification.

Hybrid setup in Shoreditch office, London, United Kingdom with remote flexibility.

Company

Deep-tech company with roots in numerical physics and Formula One, building AI-driven simulation software stack for engineering and manufacturing in Aerospace & Defense, Materials, Energy, Semiconductors, and Automotive.

What you will do

  • Design and operate distributed training infrastructure for neural operator architectures on NVIDIA DGX B200, optimizing for throughput, fault tolerance, and cost efficiency.
  • Build experiment tracking and observability systems for training runs, hyperparameter sweeps, and model performance.
  • Solve data loading bottlenecks for large-scale mesh datasets and optimize I/O pipelines from cloud storage.
  • Develop serving infrastructure for pre-trained Large Physics Models, supporting zero-shot inference and uncertainty quantification.
  • Implement model packaging pipelines for reliable customer deployment with fine-tuning capabilities and reproducibility.
  • Improve developer experience with fast iteration cycles, reliable CI/CD, and debugging tools, collaborating on shared infrastructure standards.

Requirements

  • 5+ years building and operating ML infrastructure at scale, with deep expertise in distributed training (NCCL, FSDP/DDP/pipeline parallelism).
  • Strong systems fundamentals: Linux, networking (NVLink, InfiniBand), storage I/O, profiling, and performance optimization.
  • Production experience with Kubernetes and SLURM for GPU cluster orchestration.
  • Proficiency in Python and ML frameworks (PyTorch preferred).
  • Experience with cloud GPU infrastructure (e.g., CoreWeave).
  • Excellent collaboration and communication skills especially in research settings, ability to scope projects and solve problems quickly.

Nice to have

  • Experience with geometric deep learning, neural operators on meshes/point clouds/graphs.
  • Background in HPC for simulation engineering (CFD/FEA workflows).
  • Experience building model serving with latency/throughput requirements.
  • Familiarity with experiment tracking (Weights & Biases, MLflow) and observability (Prometheus, Grafana).
  • Experience packaging models for customer environments (containers, registries, versioning).

Culture & Benefits

  • Equity options and 10% employer pension contribution.
  • Free office lunches, flexible working, and hybrid setup with remote flexibility including work from anywhere perks.
  • Enhanced parental leave and private healthcare.
  • Personal development with access to learning and training.
  • Commitment to diversity, equal opportunity, and sponsoring women from disadvantaged backgrounds in STEM.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →