Назад
Company hidden
6 дней назад

AI Inference Engineer - Model Optimization & Deployment (AI)

242 000 - 290 000$
Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

AI Inference Engineer - Model Optimization & Deployment (AI): Optimizing and deploying large-scale models (LLMs, VLMs) for power- and thermal-constrained vehicle SOCs with an accent on quantization, mixed-precision inference, and custom CUDA kernels. Focus on architecting TensorRT pipelines, writing concurrent C++/Python inference code, and ensuring real-time deterministic execution on edge devices.

Hybrid: Foster City, CA / San Diego, CA / Seattle, WA

$242,000 - $290,000 a year

Company

hirify.global is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem at the intersection of robotics, machine learning, and design.

What you will do

  • Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference, and parameter-efficient fine-tuning (LoRA, QLoRA).
  • Architect and implement model conversion/compilation pipelines with TensorRT and TensorRT-LLM for edge deployment.
  • Perform parity checking, accuracy recovery, and latency benchmarking between PyTorch and compiled edge binaries.
  • Write and optimize custom CUDA kernels and TensorRT Plugins for AI accelerators.
  • Develop production-level, concurrent, memory-safe C++ and Python code for real-time inference on vehicle SOCs.

Requirements

  • Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference (INT8, FP8, INT4, BF16/FP16).
  • Proven experience optimizing LLMs/VLMs with KV-cache (PagedAttention), Speculative Decoding, FlashAttention.
  • Extensive experience with TensorRT/TensorRT-LLM pipelines and parity/latency benchmarking.
  • Proficiency in low-level programming: custom CUDA kernels and TensorRT Plugins.
  • Production-level C++ (14/17/20) and Python for concurrent, real-time edge inference.

Nice to have

  • Experience with distributed training (PyTorch Distributed, DeepSpeed, Megatron-LM).
  • Familiarity with autonomous driving perception (3D detection, BEV, Occupancy Networks, multi-modal sensors).
  • Understanding of end-to-end autonomous driving (VLA models, closed-loop simulation).

Culture & Benefits

  • Comprehensive benefits: paid time off (sick leave, vacation, bereavement), unpaid time off.
  • Equity: hirify.global Stock Appreciation Rights, Amazon RSUs; possible sign-on bonus.
  • Insurance: health, long-term care, long-term/short-term disability, life insurance.
  • Fast-moving, execution-oriented team focused on innovation in autonomous mobility.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →