HPC Engineer (AI)

240 000 - 356 000$

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Engineer (AI): Deploying and configuring large-scale HPC clusters for AI workloads with an accent on logical provisioning, networking fabrics, and system stability. Focus on optimizing RDMA/NCCL environments, troubleshooting GPU-direct connectivity, and scaling cluster operations to thousands of nodes.

Location: Hybrid; must be based in San Francisco, San Jose, or Bellevue (WA) with presence in office 4 days per week.

Salary: $240,000 – $356,000 per year

Company

A leader in AI cloud infrastructure providing GPU compute for AI researchers and enterprises.

What you will do

Remotely deploy and configure large-scale HPC clusters for AI workloads, scaling up to many thousands of nodes.
Install and configure operating systems, firmware, software, and networking using both manual and automation tools.
Troubleshoot and resolve HPC cluster issues in close collaboration with on-site physical deployment teams.
Provide detailed requirements to other engineering teams to improve system simplification, stability, and operational efficiency.
Create and maintain Standard Operating Procedures (SOPs) and provide regular project updates.
Mentor and assist less experienced team members.

Requirements

5+ years of experience deploying and configuring HPC clusters for AI workloads.
Expertise in SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics.
Deep knowledge of Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, and Horovod environments.
Proficiency in Linux-based compute nodes, firmware updates, and driver installation.
Experience with SLURM, Kubernetes, or other job scheduling systems.
Flexibility to travel to North American data centers as on-site needs arise.

Nice to have

Experience with ML/DL frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf).
Experience with containerization technologies such as Docker and Kubernetes.
Knowledge of GPU acceleration, virtualization, and cloud computing.
Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience.

Culture & Benefits

Generous cash and equity compensation.
Comprehensive health, dental, and vision coverage for employees and dependents.
401k Plan with 2% company match for USA employees.
Flexible paid time off plan.
Wellness and commuter stipends for select roles.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

HPC Engineer (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

HPC Administrator (Linux)

Senior Network Engineer (AI)

HPC Linux System Administrator (AI)

Infrastructure Support Engineer (AI)

Senior Applications Engineer (PLM 3DExperience)

Implementation Engineer (Healthcare AI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business