AI Inference Engineer - Model Optimization & Deployment (AI)

242 000 - 290 000$

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

AI Inference Engineer - Model Optimization & Deployment (AI): Optimizing and deploying large-scale models (LLMs, VLMs) for power- and thermal-constrained vehicle SOCs with an accent on quantization, mixed-precision inference, and custom CUDA kernels. Focus on architecting TensorRT pipelines, writing concurrent C++/Python inference code, and ensuring real-time deterministic execution on edge devices.

Hybrid: Foster City, CA / San Diego, CA / Seattle, WA

$242,000 - $290,000 a year

Company

hirify.global is developing the first ground-up, fully autonomous vehicle fleet and the supporting ecosystem at the intersection of robotics, machine learning, and design.

What you will do

Optimize large-scale models (LLMs, VLMs) using advanced quantization (PTQ, QAT), mixed-precision inference, and parameter-efficient fine-tuning (LoRA, QLoRA).
Architect and implement model conversion/compilation pipelines with TensorRT and TensorRT-LLM for edge deployment.
Perform parity checking, accuracy recovery, and latency benchmarking between PyTorch and compiled edge binaries.
Write and optimize custom CUDA kernels and TensorRT Plugins for AI accelerators.
Develop production-level, concurrent, memory-safe C++ and Python code for real-time inference on vehicle SOCs.

Requirements

Deep expertise in model quantization (PTQ, QAT) and mixed-precision inference (INT8, FP8, INT4, BF16/FP16).
Proven experience optimizing LLMs/VLMs with KV-cache (PagedAttention), Speculative Decoding, FlashAttention.
Extensive experience with TensorRT/TensorRT-LLM pipelines and parity/latency benchmarking.
Proficiency in low-level programming: custom CUDA kernels and TensorRT Plugins.
Production-level C++ (14/17/20) and Python for concurrent, real-time edge inference.

Nice to have

Experience with distributed training (PyTorch Distributed, DeepSpeed, Megatron-LM).
Familiarity with autonomous driving perception (3D detection, BEV, Occupancy Networks, multi-modal sensors).
Understanding of end-to-end autonomous driving (VLA models, closed-loop simulation).

Culture & Benefits

Comprehensive benefits: paid time off (sick leave, vacation, bereavement), unpaid time off.
Equity: hirify.global Stock Appreciation Rights, Amazon RSUs; possible sign-on bonus.
Insurance: health, long-term care, long-term/short-term disability, life insurance.
Fast-moving, execution-oriented team focused on innovation in autonomous mobility.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →