Member Of Technical Staff - Training Platform (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Member of Technical Staff - Training Platform (AI Infrastructure): Building a hosted training platform for LoRA and full fine-tuning runs on managed GPU clusters with an accent on Kubernetes orchestration and Python control-plane development. Focus on implementing scheduling for heterogeneous hardware, developing real-time monitoring tools, and productizing new RL training capabilities.
Location: San Francisco or Remote. Full visa sponsorship and relocation support provided.
Salary: $150K–$300K with significant equity
Company
is building the open superintelligence stack to enable researchers and enterprises to run end-to-end reinforcement learning at frontier scale.
What you will do
- Design and operate Kubernetes-based training and inference orchestration across multi-cluster, multi-cloud GPU fleets.
- Build Helm charts for reproducible training stacks and develop Python control-plane agents to sync clusters.
- Implement scheduling and autoscaling for H100/H200/B200 hardware using KEDA, LeaderWorkerSet, and gang scheduling.
- Develop developer-facing platform surfaces for job submission and live run monitoring using FastAPI, Next.js, and React.
- Build real-time monitoring tools for streaming logs, step-level metrics, and failure analysis.
- Interface with RL trainers and inference servers to productize new training capabilities and model architectures.
Requirements
- Strong working knowledge of the AI stack: finetuning (LoRA, RLHF), inference engines (vLLM, SGLang), and GPU hardware (NVLink, memory hierarchy).
- Deep Kubernetes operations experience including Helm, CRDs, Operators, KEDA, and GPU operator.
- Proficient Python backend development (FastAPI, async, SQLAlchemy) and modern frontend skills (TypeScript, React, Next.js).
- Experience with GCP (GKE, GCS, Cloud Run) and infrastructure automation via Terraform and Ansible.
- Strong understanding of distributed training fundamentals (data/tensor/pipeline parallelism, NCCL).
- Experience building developer tools, dashboards, and live-monitoring UIs.
Culture & Benefits
- Competitive cash compensation ($150K–$300K) with significant equity.
- Flexible work arrangement with the option for remote work or San Francisco office.
- Full visa sponsorship and relocation support.
- Professional development budget for courses and conferences.
- Regular team off-sites and a culture of contributing to the open-source AI community.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →