Member of Technical Staff - ML Infrastructure Engineer (AI)

180 000 - 300 000$

Формат работы

remote (только Europe/United_states)/hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

US/Germany

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Member of Technical Staff - ML Infrastructure Engineer (AI/MLOps): Designing and maintaining the cloud-based infrastructure for frontier AI research with an accent on training and inference clusters, network-based storage, and IaC. Focus on optimizing GPU resource allocation, reducing training bottlenecks, and building scalable CI/CD pipelines for generative models.

Location: Must be based in or be able to commute to Freiburg (Germany) or San Francisco (USA) for hybrid work (2 days/week) or remote work with a mandatory monthly in-person week.

Salary: $180,000–$300,000 USD

Company

hirify.global is a research lab creating foundational generative models, including FLUX, used by millions of creators and developers worldwide.

What you will do

Design, deploy, and maintain cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes).
Manage network-based cloud file systems and S3 storage optimized for large-scale ML workloads.
Develop and maintain Infrastructure as Code (IaC) using Terraform and Ansible to prevent configuration drift.
Implement and optimize CI/CD pipelines for ML workflows to accelerate the path from experiment to production.
Design custom autoscaling solutions for ML workloads and ensure security best practices across the stack.
Build developer-friendly tools and practices to make ML operations efficient for researchers.

Requirements

Strong proficiency in cloud platforms (AWS, Azure, or GCP) focusing on AI/ML services.
Extensive production experience with Kubernetes and Slurm cluster management.
Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.).
Proven track record managing and optimizing network-based cloud file systems and object storage for ML.
Experience with CI/CD tools such as CircleCI, GitHub Actions, or ArgoCD in ML contexts.
Requirement to join the team in Freiburg or SF at least 2 days a week, or work remotely with a monthly in-person week.

Nice to have

Experience building custom autoscaling solutions for ML workloads.
Knowledge of cost optimization strategies for cloud-based GPU infrastructure.
Familiarity with MLOps practices, HPC environments, and data versioning.
Knowledge of network optimization techniques for distributed ML training.

Culture & Benefits

Work in a frontier research lab focused on research excellence and open science.
Low-ego culture where the best idea wins and credit is shared.
Bold approach to shipping and taking ambitious technical bets.
Company covers reasonable travel costs for mandatory in-person connection weeks.
High-impact environment within a small team (~50 people) pushing the edge of generative AI.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →