Manager, HPC Storage Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Manager, HPC Storage Engineer (AI): Building and operating global distributed storage platforms for AI training and inference with an accent on high-performance shared filesystems and low-latency data paths. Focus on designing SAN/NFS architectures, optimizing NVMe/RDMA performance, and scaling storage infrastructure for GPU clusters.
Location: Remote, USA
Salary: $150,000 - $240,000 USD
Company
is pioneering the future of AI and machine learning, offering cutting-edge cloud infrastructure for full‑stack AI applications.
What you will do
- Define and evolve the global distributed storage architecture supporting training, inference, and dataset access at scale.
- Manage and grow a team of storage and systems engineers, setting clear technical direction and operational standards.
- Design and operate large-scale SAN and NFS deployments, specifically leveraging VAST Data and parallel filesystems like Lustre.
- Drive end-to-end performance optimization from NAND/NVMe media through controllers, networking, and client access patterns.
- Evaluate and deploy cutting-edge capabilities such as NFS over RDMA and GPU Direct Storage (GDS).
- Partner with Datacenter Networking, GPU Platform, and SRE teams to ensure storage systems meet AI workload requirements.
Requirements
- 3+ years of experience managing storage, systems, or infrastructure engineering teams in production.
- 8+ years of experience designing and operating multi-petabyte scale distributed storage systems (SAN/NFS).
- Hands-on experience deploying and operating VAST Data in production environments is required.
- Experience with parallel filesystems such as Lustre, GPFS, or BeeGFS.
- Deep understanding of NAND, NVMe, PCIe, and Linux internals (I/O scheduling, memory management, and performance tuning).
- Must be based in the USA.
Nice to have
- Experience supporting AI training pipelines, large-scale model checkpointing, and dataset streaming.
- Familiarity with RDMA fabrics and collaboration with datacenter networking teams.
- Experience designing storage for multi-tenant isolation and secure data access.
- Background in hyperscale, HPC, or AI-focused infrastructure environments.
Culture & Benefits
- Meaningful equity and stock options for all employees.
- 100% coverage for medical, dental, and vision plans.
- Flexible PTO to ensure work-life balance.
- Remote-first culture with a collaborative team environment utilizing Slack.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →