HPC Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
HPC Engineer (AI): Designing and operating high-performance computing infrastructure for AI and scientific research with an accent on GPU/CPU clusters and high-speed interconnects. Focus on optimizing workload scheduling, parallel storage throughput, and scaling compute environments to accelerate scientific discovery.
Location: Preferred in Menlo Park or San Francisco, but flexible based on role
Compensation: $350,000–$450,000
Company
An AI and physical sciences company building state-of-the-art models to accelerate breakthroughs across materials and energy.
What you will do
- Design, deploy, and operate large-scale GPU and CPU clusters for AI training and scientific simulation.
- Optimize high-speed interconnect fabrics (InfiniBand, RoCE) and parallel filesystems (Lustre, GPFS, WEKA).
- Manage workload scheduling and resource allocation using Slurm, Kubernetes, or similar tools.
- Implement automated cluster provisioning and configuration management via Ansible and Terraform.
- Partner with research and ML teams to profile workloads and tune hardware/software stacks.
- Establish standards for HPC operations, capacity planning, and disaster recovery strategies.
Requirements
- Experience designing and operating large-scale HPC or GPU clusters.
- Deep knowledge of InfiniBand (HDR/NDR) or RoCE fabric management and troubleshooting.
- Hands-on experience with parallel storage systems like Lustre, GPFS, or WEKA.
- Proficiency in Linux systems administration, including kernel tuning and GPU driver management.
- Experience with GPU computing environments (CUDA, NCCL, MPI) and multi-node distributed training.
- Bachelor’s degree or equivalent experience.
Nice to have
- Experience with multi-node transformer training for large-scale AI/ML workloads.
- Familiarity with AI accelerators such as TPUs or Trainium.
- Background in computational chemistry, physics simulation, or bioinformatics.
- Experience with containerized HPC environments (Singularity, Apptainer).
- Contributions to open-source HPC tooling.
Culture & Benefits
- Visa sponsorship is provided with full legal support.
- High-pace environment focused on defining the frontier of AI in the physical world.
- Opportunity to work with world-class investors and top-tier scientists.
- Direct impact on breakthroughs in materials and energy science.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →