Member of Technical Staff, DevOps/Infrastructure Engineering (AI)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

middle/senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Member of Technical Staff, DevOps / Infrastructure Engineering (AI): Architecting hybrid cloud and HPC infrastructure for large-scale model training workflows with an accent on automation, CI/CD pipelines, and GPU cluster management. Focus on designing reliable systems for multi-week pre-training experiments, optimizing resource allocation, and implementing observability for high-performance research environments.

Location: Anywhere - Remote

Company

Non-profit organization building an autonomous AI Physicist to explore theoretical frameworks and generate novel insights in fundamental physics research.

What you will do

Design and automate large-scale pre-training experiments across dense and MoE architectures in hybrid cloud and HPC environments.
Build and own CI/CD pipelines for training workflows, evaluation jobs, and internal tools with rollback, observability, and safety features.
Manage GPU clusters, InfiniBand networks, Slurm/Kubernetes schedulers, and container orchestration for efficient workloads.
Implement monitoring, logging, alerting with Prometheus, Grafana, ELK/EFK, and establish SLOs for infrastructure reliability.
Handle security with secrets management, IAM, least privilege principles, and zero-trust posture.
Collaborate with researchers and engineers to translate needs into self-service infrastructure patterns and best practices.

Requirements

Bachelor's or Master's in Computer Science, Engineering, or related field.
3-10+ years in DevOps, Infrastructure, or SRE with hands-on Unix/Linux, kernel tuning, networking, and storage experience.
Expertise in Infrastructure-as-Code (Terraform, Pulumi, CloudFormation), CI/CD (GitHub Actions, GitLab CI, Jenkins), AWS, Kubernetes, Slurm.
Experience with monitoring stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry) and scaling GPU/HPC workloads.
Strong programming in Python, Go, or Rust plus Bash; cross-functional collaboration skills.
Mission-driven mindset for fast-growing environment tackling scientific challenges; passion for physics.

Nice to have

Work with HPC vendors (Buzz, Lambda, NVIDIA DGX, CoreWeave).
Self-service infrastructure or internal developer platforms.
Deep GPU cluster management, InfiniBand, cost optimization.
Build tools (CMake, Bazel, Meson); AI/ML research support.

Culture & Benefits

Global nonprofit with Canadian foundation and US 501(c)(3), operating in startup-style environment.
Entrepreneurial, mission-driven culture focused on automation, DevOps philosophy, and operational excellence.
Emphasis on collaboration, mentoring, documentation, and enabling researchers to focus on breakthroughs.
Opportunities to build state-of-the-art platforms powering autonomous AI research.

Hiring process

Submit resume, cover letter with role title detailing qualifications and vision, and references.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Member of Technical Staff, DevOps/Infrastructure Engineering (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Hiring process

Похожие вакансии

Site Reliability Engineer (AI Infrastructure)

Senior Platform Engineer (AI)

DevOps / Platform Engineer (Fintech + AI Infrastructure)

Platform Engineer (Cloud/K8s)

System Administrator / DevOps Engineer

Senior AI-Enabled DevOps Engineer