AI Engineer (Model Performance)

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

AI Engineer (Model Performance): Owning the speed, cost, and reliability of the model inference stack and building fine-tuning infrastructure with an accent on LLM serving optimization and GPU efficiency. Focus on reducing latency via quantization and speculative decoding, and creating repeatable pipelines for model distillation and preference tuning.

Location: Hybrid in San Francisco

Company

AI assistant that captures, summarizes, and organizes meeting moments to eliminate note-taking overhead.

What you will do

Optimize model inference for speed and cost using speculative decoding, quantization, and batching strategies.
Build repeatable fine-tuning infrastructure for distillation, adapter training, and DPO.
Benchmark quantization (e.g., FP8) across GPU families to maximize speedup with minimal quality loss.
Evaluate and tune serving frameworks like vLLM and SGLang.
Manage GPU spend by selecting hardware based on workload concurrency and latency needs.
Debug production inference issues and quality regressions in multimodal pipelines.

Requirements

Deep experience tuning LLM serving frameworks (vLLM, SGLang, TensorRT-LLM).
Hands-on expertise in weight and activation quantization.
Production experience with LoRA/QLoRA SFT and training frameworks like Axolotl or torchtune.
Strong Python skills for infrastructure and benchmarking.
Proficiency in GPU profiling and performance analysis.
Must be based in or able to work hybrid in San Francisco.

Nice to have

GPU infrastructure cost modeling.
Experience with multimodal models (audio/vision).
Familiarity with Modal or Ray Serve.
Knowledge of audio processing (codecs, sample rates).
Experience building internal developer tooling.

Culture & Benefits

High-impact role in a growing product company.
Async-first culture using Slack, Notion, and Loom.
Collaborative environment working closely with the CEO.
Competitive compensation and benefits.

Hiring process

Interviews with the entire team.
Quick turnaround (typically less than a week).
Requires a write-up/demo of optimization work and a hirify.global self-interview.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →