Назад
Company hidden
6 дней назад

Member of Technical Staff - ML Infrastructure Engineer (AI)

180 000 - 300 000$
Формат работы
remote (только Europe/United_states)/hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US/Germany
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Member of Technical Staff - ML Infrastructure Engineer (AI/MLOps): Designing and maintaining the cloud-based infrastructure for frontier AI research with an accent on training and inference clusters, network-based storage, and IaC. Focus on optimizing GPU resource allocation, reducing training bottlenecks, and building scalable CI/CD pipelines for generative models.

Location: Must be based in or be able to commute to Freiburg (Germany) or San Francisco (USA) for hybrid work (2 days/week) or remote work with a mandatory monthly in-person week.

Salary: $180,000–$300,000 USD

Company

hirify.global is a research lab creating foundational generative models, including FLUX, used by millions of creators and developers worldwide.

What you will do

  • Design, deploy, and maintain cloud-based ML training clusters (Slurm) and inference clusters (Kubernetes).
  • Manage network-based cloud file systems and S3 storage optimized for large-scale ML workloads.
  • Develop and maintain Infrastructure as Code (IaC) using Terraform and Ansible to prevent configuration drift.
  • Implement and optimize CI/CD pipelines for ML workflows to accelerate the path from experiment to production.
  • Design custom autoscaling solutions for ML workloads and ensure security best practices across the stack.
  • Build developer-friendly tools and practices to make ML operations efficient for researchers.

Requirements

  • Strong proficiency in cloud platforms (AWS, Azure, or GCP) focusing on AI/ML services.
  • Extensive production experience with Kubernetes and Slurm cluster management.
  • Expertise in Infrastructure as Code tools (Terraform, Ansible, etc.).
  • Proven track record managing and optimizing network-based cloud file systems and object storage for ML.
  • Experience with CI/CD tools such as CircleCI, GitHub Actions, or ArgoCD in ML contexts.
  • Requirement to join the team in Freiburg or SF at least 2 days a week, or work remotely with a monthly in-person week.

Nice to have

  • Experience building custom autoscaling solutions for ML workloads.
  • Knowledge of cost optimization strategies for cloud-based GPU infrastructure.
  • Familiarity with MLOps practices, HPC environments, and data versioning.
  • Knowledge of network optimization techniques for distributed ML training.

Culture & Benefits

  • Work in a frontier research lab focused on research excellence and open science.
  • Low-ego culture where the best idea wins and credit is shared.
  • Bold approach to shipping and taking ambitious technical bets.
  • Company covers reasonable travel costs for mandatory in-person connection weeks.
  • High-impact environment within a small team (~50 people) pushing the edge of generative AI.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →