HPC Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
HPC Engineer (AI): Deploying and configuring large-scale HPC clusters for AI workloads with an accent on logical provisioning, networking fabrics, and system stability. Focus on optimizing RDMA/NCCL environments, troubleshooting GPU-direct connectivity, and scaling cluster operations to thousands of nodes.
Location: Hybrid; must be based in San Francisco, San Jose, or Bellevue (WA) with presence in office 4 days per week.
Salary: $240,000 – $356,000 per year
Company
A leader in AI cloud infrastructure providing GPU compute for AI researchers and enterprises.
What you will do
- Remotely deploy and configure large-scale HPC clusters for AI workloads, scaling up to many thousands of nodes.
- Install and configure operating systems, firmware, software, and networking using both manual and automation tools.
- Troubleshoot and resolve HPC cluster issues in close collaboration with on-site physical deployment teams.
- Provide detailed requirements to other engineering teams to improve system simplification, stability, and operational efficiency.
- Create and maintain Standard Operating Procedures (SOPs) and provide regular project updates.
- Mentor and assist less experienced team members.
Requirements
- 5+ years of experience deploying and configuring HPC clusters for AI workloads.
- Expertise in SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics.
- Deep knowledge of Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, and Horovod environments.
- Proficiency in Linux-based compute nodes, firmware updates, and driver installation.
- Experience with SLURM, Kubernetes, or other job scheduling systems.
- Flexibility to travel to North American data centers as on-site needs arise.
Nice to have
- Experience with ML/DL frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf).
- Experience with containerization technologies such as Docker and Kubernetes.
- Knowledge of GPU acceleration, virtualization, and cloud computing.
- Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience.
Culture & Benefits
- Generous cash and equity compensation.
- Comprehensive health, dental, and vision coverage for employees and dependents.
- 401k Plan with 2% company match for USA employees.
- Flexible paid time off plan.
- Wellness and commuter stipends for select roles.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →