Member of Technical Staff - ML Infrastructure Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member of Technical Staff - ML Infrastructure Engineer (AI/MLOps): Designing and maintaining the cloud-based infrastructure for frontier AI research with an accent on training and inference clusters, network-based storage, and IaC. Focus on optimizing GPU resource allocation, reducing training bottlenecks, and building scalable CI/CD pipelines for generative models.
Location: Must be based in or be able to commute to Freiburg (Germany) or San Francisco (USA) for hybrid work (2 days/week) or remote work with a mandatory monthly in-person week.
Salary: $180,000–$300,000 USD
Company
is a research lab creating foundational generative models, including FLUX, used by millions of creators and developers worldwide.
What you will do
- Design, deploy, and maintain cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes).
- Manage network-based cloud file systems and S3 storage optimized for large-scale ML workloads.
- Develop and maintain Infrastructure as Code (IaC) using Terraform and Ansible to prevent configuration drift.
- Implement and optimize CI/CD pipelines for ML workflows to accelerate the path from experiment to production.
- Design custom autoscaling solutions for ML workloads and ensure security best practices across the stack.
- Build developer-friendly tools and practices to make ML operations efficient for researchers.
Requirements
- Strong proficiency in cloud platforms (AWS, Azure, or GCP) focusing on AI/ML services.
- Extensive production experience with Kubernetes and Slurm cluster management.
- Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.).
- Proven track record managing and optimizing network-based cloud file systems and object storage for ML.
- Experience with CI/CD tools such as CircleCI, GitHub Actions, or ArgoCD in ML contexts.
- Requirement to join the team in Freiburg or SF at least 2 days a week, or work remotely with a monthly in-person week.
Nice to have
- Experience building custom autoscaling solutions for ML workloads.
- Knowledge of cost optimization strategies for cloud-based GPU infrastructure.
- Familiarity with MLOps practices, HPC environments, and data versioning.
- Knowledge of network optimization techniques for distributed ML training.
Culture & Benefits
- Work in a frontier research lab focused on research excellence and open science.
- Low-ego culture where the best idea wins and credit is shared.
- Bold approach to shipping and taking ambitious technical bets.
- Company covers reasonable travel costs for mandatory in-person connection weeks.
- High-impact environment within a small team (~50 people) pushing the edge of generative AI.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →