Member Of Technical Staff - Training Platform (AI Infrastructure)

150 000 - 300 000$

Формат работы

remote/hybrid/onsite

Тип работы

fulltime

Английский

Страна

Релокация

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Member of Technical Staff - Training Platform (AI Infrastructure): Building a hosted training platform for LoRA and full fine-tuning runs on managed GPU clusters with an accent on Kubernetes orchestration and Python control-plane development. Focus on implementing scheduling for heterogeneous hardware, developing real-time monitoring tools, and productizing new RL training capabilities.

Location: San Francisco or Remote. Full visa sponsorship and relocation support provided.

Salary: $150K–$300K with significant equity

Company

hirify.global is building the open superintelligence stack to enable researchers and enterprises to run end-to-end reinforcement learning at frontier scale.

What you will do

Design and operate Kubernetes-based training and inference orchestration across multi-cluster, multi-cloud GPU fleets.
Build Helm charts for reproducible training stacks and develop Python control-plane agents to sync clusters.
Implement scheduling and autoscaling for H100/H200/B200 hardware using KEDA, LeaderWorkerSet, and gang scheduling.
Develop developer-facing platform surfaces for job submission and live run monitoring using FastAPI, Next.js, and React.
Build real-time monitoring tools for streaming logs, step-level metrics, and failure analysis.
Interface with RL trainers and inference servers to productize new training capabilities and model architectures.

Requirements

Strong working knowledge of the AI stack: finetuning (LoRA, RLHF), inference engines (vLLM, SGLang), and GPU hardware (NVLink, memory hierarchy).
Deep Kubernetes operations experience including Helm, CRDs, Operators, KEDA, and GPU operator.
Proficient Python backend development (FastAPI, async, SQLAlchemy) and modern frontend skills (TypeScript, React, Next.js).
Experience with GCP (GKE, GCS, Cloud Run) and infrastructure automation via Terraform and Ansible.
Strong understanding of distributed training fundamentals (data/tensor/pipeline parallelism, NCCL).
Experience building developer tools, dashboards, and live-monitoring UIs.

Culture & Benefits

Competitive cash compensation ($150K–$300K) with significant equity.
Flexible work arrangement with the option for remote work or San Francisco office.
Full visa sponsorship and relocation support.
Professional development budget for courses and conferences.
Regular team off-sites and a culture of contributing to the open-source AI community.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →