Назад
Company hidden
3 дня назад

HPC Engineer (AI)

240 000 - 356 000$
Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Engineer (AI): Deploying and configuring large-scale HPC clusters for AI workloads with an accent on logical provisioning, networking fabrics, and system stability. Focus on optimizing RDMA/NCCL environments, troubleshooting GPU-direct connectivity, and scaling cluster operations to thousands of nodes.

Location: Hybrid; must be based in San Francisco, San Jose, or Bellevue (WA) with presence in office 4 days per week.

Salary: $240,000 – $356,000 per year

Company

A leader in AI cloud infrastructure providing GPU compute for AI researchers and enterprises.

What you will do

  • Remotely deploy and configure large-scale HPC clusters for AI workloads, scaling up to many thousands of nodes.
  • Install and configure operating systems, firmware, software, and networking using both manual and automation tools.
  • Troubleshoot and resolve HPC cluster issues in close collaboration with on-site physical deployment teams.
  • Provide detailed requirements to other engineering teams to improve system simplification, stability, and operational efficiency.
  • Create and maintain Standard Operating Procedures (SOPs) and provide regular project updates.
  • Mentor and assist less experienced team members.

Requirements

  • 5+ years of experience deploying and configuring HPC clusters for AI workloads.
  • Expertise in SFP+ fiber, Infiniband (IB), and 100 GbE network fabrics.
  • Deep knowledge of Ethernet, switching, power infrastructure, GPU direct, RDMA, NCCL, and Horovod environments.
  • Proficiency in Linux-based compute nodes, firmware updates, and driver installation.
  • Experience with SLURM, Kubernetes, or other job scheduling systems.
  • Flexibility to travel to North American data centers as on-site needs arise.

Nice to have

  • Experience with ML/DL frameworks (PyTorch, TensorFlow) and benchmarking tools (DeepSpeed, MLPerf).
  • Experience with containerization technologies such as Docker and Kubernetes.
  • Knowledge of GPU acceleration, virtualization, and cloud computing.
  • Bachelors degree in EE, CS, Physics, Mathematics, or equivalent work experience.

Culture & Benefits

  • Generous cash and equity compensation.
  • Comprehensive health, dental, and vision coverage for employees and dependents.
  • 401k Plan with 2% company match for USA employees.
  • Flexible paid time off plan.
  • Wellness and commuter stipends for select roles.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →