Назад
Company hidden
7 месяцев назад

Software Engineer (Supercomputing)

350 000 - 475 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
middle/senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Software Engineer (Supercomputing): Design, build, and operate GPU supercomputing environments for large-scale AI training and inference with an accent on high performance, reliability, and cost efficiency. Focus on cluster management, orchestration, and operational metrics to support fast and scalable AI research.

Location: San Francisco, California, USA

Salary: $350,000 - $475,000 USD per year

Company

hirify.global advances collaborative general intelligence by building widely used AI products and open-source projects.

What you will do

  • Operate and automate large GPU clusters including provisioning, imaging, and capacity planning.
  • Develop software to abstract cluster management and provide unified interfaces for training and inference.
  • Extend scheduling/orchestration systems like Kubernetes or Slurm for topology-aware placement and multi-tenancy.
  • Monitor and improve operational metrics such as speed, reliability, and error recovery.
  • Build reliable storage and artifact paths for datasets, checkpoints, and logs with clear retention and lineage.
  • Collaborate with researchers to unblock scale runs and advise on parallelism and performance trade-offs.

Requirements

  • Location: Must be based in San Francisco, California
  • Bachelor’s degree or equivalent experience in computer science or engineering.
  • Proficiency in backend languages such as Python or Rust.
  • Experience operating large-scale clusters and container orchestration systems (Kubernetes, Slurm).
  • Strong collaboration skills and ability to own projects end-to-end.
  • English: B2 or higher proficiency required

Nice to have

  • Strong systems background including Linux, networking, and infrastructure-as-code.
  • Familiarity with CUDA/NCCL and performance profiling for distributed training/inference.
  • Experience supporting large-scale model training or inference environments.
  • Understanding of deep learning frameworks like PyTorch, TensorFlow, and JAX.
  • Experience working in fast-paced environments balancing care with urgency.

Culture & Benefits

  • Visa sponsorship and relocation support available.
  • Generous health, dental, and vision benefits.
  • Unlimited PTO and paid parental leave.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →