AI Engineer (Model Performance)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
AI Engineer (Model Performance): Owning the speed, cost, and reliability of the model inference stack and building fine-tuning infrastructure with an accent on LLM serving optimization and GPU efficiency. Focus on reducing latency via quantization and speculative decoding, and creating repeatable pipelines for model distillation and preference tuning.
Location: Hybrid in San Francisco
Company
AI assistant that captures, summarizes, and organizes meeting moments to eliminate note-taking overhead.
What you will do
- Optimize model inference for speed and cost using speculative decoding, quantization, and batching strategies.
- Build repeatable fine-tuning infrastructure for distillation, adapter training, and DPO.
- Benchmark quantization (e.g., FP8) across GPU families to maximize speedup with minimal quality loss.
- Evaluate and tune serving frameworks like vLLM and SGLang.
- Manage GPU spend by selecting hardware based on workload concurrency and latency needs.
- Debug production inference issues and quality regressions in multimodal pipelines.
Requirements
- Deep experience tuning LLM serving frameworks (vLLM, SGLang, TensorRT-LLM).
- Hands-on expertise in weight and activation quantization.
- Production experience with LoRA/QLoRA SFT and training frameworks like Axolotl or torchtune.
- Strong Python skills for infrastructure and benchmarking.
- Proficiency in GPU profiling and performance analysis.
- Must be based in or able to work hybrid in San Francisco.
Nice to have
- GPU infrastructure cost modeling.
- Experience with multimodal models (audio/vision).
- Familiarity with Modal or Ray Serve.
- Knowledge of audio processing (codecs, sample rates).
- Experience building internal developer tooling.
Culture & Benefits
- High-impact role in a growing product company.
- Async-first culture using Slack, Notion, and Loom.
- Collaborative environment working closely with the CEO.
- Competitive compensation and benefits.
Hiring process
- Interviews with the entire team.
- Quick turnaround (typically less than a week).
- Requires a write-up/demo of optimization work and a self-interview.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →