ML Platform Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
ML Platform Engineer (AI/MLOps): Architecting and scaling infrastructure for foundation-model training and serving with an accent on compute orchestration and model lifecycle management. Focus on building GPU-backed serving, distributed compute layers using Ray and Kubernetes, and ensuring the reliable transition of models from research to production.
Location: Must be based in The Netherlands or Switzerland (Hybrid: at least 50% office time)
Company
Building a next-generation agentic clinical AI assistant to help clinicians reason across patient data and diagnostics.
What you will do
- Design and evolve infrastructure for fast, reliable, and observable ML development using IaC, CI/CD, and Kubernetes.
- Scale GPU workloads across on-prem and cloud clusters using Kubernetes and Ray.
- Own and evolve the AI Factory, specifically the Dagster-based orchestrator and its Ray integration.
- Build and maintain the model lifecycle layer, including experiment tracking, registry, versioning, and GPU-backed serving.
- Collaborate with research and product engineering to translate platform requirements into shared infrastructure.
- Implement engineering rigor through lineage, reproducibility, and comprehensive documentation.
Requirements
- 2-5 years of experience in production ML platform engineering or MLOps.
- Proficiency with Kubernetes, Helm, Terraform, Docker, and CI/CD tooling (ArgoCD, GitHub Actions).
- Experience scheduling GPU workloads on Kubernetes or Ray.
- Hands-on experience with Linux and NVIDIA GPU environments, including multi-node training stacks and InfiniBand.
- Familiarity with the full ML workflow: training runs, experiment tracking (MLflow), and model serving.
- Strong software engineering skills in Python.
Nice to have
- Experience supporting large-scale foundation-model training/inference (vLLM, Triton, TorchServe).
- Knowledge of lower-level GPU communication and I/O (RDMA, GPUDirect, NCCL).
- Experience with Kubernetes-native scheduling for accelerators (Volcano, KAI Scheduler, YuniKorn).
- Work with high-performance parallel filesystems (Hammerspace, CEPH, WEKA).
- Exposure to MoE architectures or large-scale distributed training.
Culture & Benefits
- Competitive salary, pension plan, and 25 vacation days per year.
- EUR 1000 annual learning and development budget.
- High degree of autonomy and ownership over goals and critical decisions.
- Collaborative, international team environment with an emphasis on ambition.
- Annual commuting subsidy and flexible work arrangements.
Hiring process
- Screening call to align on motivation and initial fit.
- Time-limited coding assessment followed by a live debrief session.
- Deep-dive technical interview focusing on problem-solving and role-specific scenarios.
- Optional onsite meeting and final executive conversation for cultural alignment.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →