Member of Technical Staff, DevOps/Infrastructure Engineering (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member of Technical Staff, DevOps / Infrastructure Engineering (AI): Architecting hybrid cloud and HPC infrastructure for large-scale model training workflows with an accent on automation, CI/CD pipelines, and GPU cluster management. Focus on designing reliable systems for multi-week pre-training experiments, optimizing resource allocation, and implementing observability for high-performance research environments.
Location: Anywhere - Remote
Company
Non-profit organization building an autonomous AI Physicist to explore theoretical frameworks and generate novel insights in fundamental physics research.
What you will do
- Design and automate large-scale pre-training experiments across dense and MoE architectures in hybrid cloud and HPC environments.
- Build and own CI/CD pipelines for training workflows, evaluation jobs, and internal tools with rollback, observability, and safety features.
- Manage GPU clusters, InfiniBand networks, Slurm/Kubernetes schedulers, and container orchestration for efficient workloads.
- Implement monitoring, logging, alerting with Prometheus, Grafana, ELK/EFK, and establish SLOs for infrastructure reliability.
- Handle security with secrets management, IAM, least privilege principles, and zero-trust posture.
- Collaborate with researchers and engineers to translate needs into self-service infrastructure patterns and best practices.
Requirements
- Bachelor's or Master's in Computer Science, Engineering, or related field.
- 3-10+ years in DevOps, Infrastructure, or SRE with hands-on Unix/Linux, kernel tuning, networking, and storage experience.
- Expertise in Infrastructure-as-Code (Terraform, Pulumi, CloudFormation), CI/CD (GitHub Actions, GitLab CI, Jenkins), AWS, Kubernetes, Slurm.
- Experience with monitoring stacks (Prometheus, Grafana, ELK/EFK, OpenTelemetry) and scaling GPU/HPC workloads.
- Strong programming in Python, Go, or Rust plus Bash; cross-functional collaboration skills.
- Mission-driven mindset for fast-growing environment tackling scientific challenges; passion for physics.
Nice to have
- Work with HPC vendors (Buzz, Lambda, NVIDIA DGX, CoreWeave).
- Self-service infrastructure or internal developer platforms.
- Deep GPU cluster management, InfiniBand, cost optimization.
- Build tools (CMake, Bazel, Meson); AI/ML research support.
Culture & Benefits
- Global nonprofit with Canadian foundation and US 501(c)(3), operating in startup-style environment.
- Entrepreneurial, mission-driven culture focused on automation, DevOps philosophy, and operational excellence.
- Emphasis on collaboration, mentoring, documentation, and enabling researchers to focus on breakthroughs.
- Opportunities to build state-of-the-art platforms powering autonomous AI research.
Hiring process
- Submit resume, cover letter with role title detailing qualifications and vision, and references.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →