Назад
Company hidden
2 дня назад

Member Of Technical Staff - Training Platform (AI Infrastructure)

150 000 - 300 000$
Формат работы
remote/hybrid/onsite
Тип работы
fulltime
Английский
b2
Страна
US
Релокация
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Member of Technical Staff - Training Platform (AI Infrastructure): Building a hosted training platform for LoRA and full fine-tuning runs on managed GPU clusters with an accent on Kubernetes orchestration and Python control-plane development. Focus on implementing scheduling for heterogeneous hardware, developing real-time monitoring tools, and productizing new RL training capabilities.

Location: San Francisco or Remote. Full visa sponsorship and relocation support provided.

Salary: $150K–$300K with significant equity

Company

hirify.global is building the open superintelligence stack to enable researchers and enterprises to run end-to-end reinforcement learning at frontier scale.

What you will do

  • Design and operate Kubernetes-based training and inference orchestration across multi-cluster, multi-cloud GPU fleets.
  • Build Helm charts for reproducible training stacks and develop Python control-plane agents to sync clusters.
  • Implement scheduling and autoscaling for H100/H200/B200 hardware using KEDA, LeaderWorkerSet, and gang scheduling.
  • Develop developer-facing platform surfaces for job submission and live run monitoring using FastAPI, Next.js, and React.
  • Build real-time monitoring tools for streaming logs, step-level metrics, and failure analysis.
  • Interface with RL trainers and inference servers to productize new training capabilities and model architectures.

Requirements

  • Strong working knowledge of the AI stack: finetuning (LoRA, RLHF), inference engines (vLLM, SGLang), and GPU hardware (NVLink, memory hierarchy).
  • Deep Kubernetes operations experience including Helm, CRDs, Operators, KEDA, and GPU operator.
  • Proficient Python backend development (FastAPI, async, SQLAlchemy) and modern frontend skills (TypeScript, React, Next.js).
  • Experience with GCP (GKE, GCS, Cloud Run) and infrastructure automation via Terraform and Ansible.
  • Strong understanding of distributed training fundamentals (data/tensor/pipeline parallelism, NCCL).
  • Experience building developer tools, dashboards, and live-monitoring UIs.

Culture & Benefits

  • Competitive cash compensation ($150K–$300K) with significant equity.
  • Flexible work arrangement with the option for remote work or San Francisco office.
  • Full visa sponsorship and relocation support.
  • Professional development budget for courses and conferences.
  • Regular team off-sites and a culture of contributing to the open-source AI community.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →