Назад
Company hidden
2 дня назад

Member of Technical Staff, DevOps/Infrastructure Engineering (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
middle/senior
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Member of Technical Staff, DevOps / Infrastructure Engineering (AI): Architecting hybrid cloud and HPC infrastructure for large-scale model training workflows with an accent on automation, CI/CD pipelines, and GPU cluster management. Focus on designing reliable systems for multi-week pre-training experiments, optimizing resource allocation, and implementing observability for high-performance research environments.

Location: Anywhere - Remote

Company

Non-profit organization building an autonomous AI Physicist to explore theoretical frameworks and generate novel insights in fundamental physics research.

What you will do

  • Design and automate large-scale pre-training experiments across dense and MoE architectures in hybrid cloud and HPC environments.
  • Build and own CI/CD pipelines for training workflows, evaluation jobs, and internal tools with rollback, observability, and safety features.
  • Manage GPU clusters, InfiniBand networks, Slurm/Kubernetes schedulers, and container orchestration for efficient workloads.
  • Implement monitoring, logging, alerting with Prometheus, Grafana, ELK/EFK, and establish SLOs for infrastructure reliability.
  • Handle security with secrets management, IAM, least privilege principles, and zero-trust posture.
  • Collaborate with researchers and engineers to translate needs into self-service infrastructure patterns and best practices.

Requirements

  • Bachelor's or Master's in Computer Science, Engineering, or related field.
  • 3-10+ years in DevOps, Infrastructure, or SRE with hands-on Unix/Linux, kernel tuning, networking, and storage experience.
  • Expertise in Infrastructure-as-Code (Terraform, Pulumi, CloudFormation), CI/CD (GitHub Actions, GitLab CI, Jenkins), AWS, Kubernetes, Slurm.
  • Experience with monitoring stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry) and scaling GPU/HPC workloads.
  • Strong programming in Python, Go, or Rust plus Bash; cross-functional collaboration skills.
  • Mission-driven mindset for fast-growing environment tackling scientific challenges; passion for physics.

Nice to have

  • Work with HPC vendors (Buzz, Lambda, NVIDIA DGX, CoreWeave).
  • Self-service infrastructure or internal developer platforms.
  • Deep GPU cluster management, InfiniBand, cost optimization.
  • Build tools (CMake, Bazel, Meson); AI/ML research support.

Culture & Benefits

  • Global nonprofit with Canadian foundation and US 501(c)(3), operating in startup-style environment.
  • Entrepreneurial, mission-driven culture focused on automation, DevOps philosophy, and operational excellence.
  • Emphasis on collaboration, mentoring, documentation, and enabling researchers to focus on breakthroughs.
  • Opportunities to build state-of-the-art platforms powering autonomous AI research.

Hiring process

  • Submit resume, cover letter with role title detailing qualifications and vision, and references.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →