Distributed Systems Engineer, Data & Inference Platform (AI)

Формат работы

hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Distributed Systems Engineer, Data & Inference Platform (AI): Build and operate distributed inference services for LLMs at scale and large-scale data pipelines with an accent on throughput, latency optimization, and production reliability. Focus on designing economical inference systems, debugging elusive production failures, and productionizing experimental workloads from researchers and ML engineers.

Location: Hybrid in San Francisco; Bay Area

Company

Building efficient intelligence that evolves in real-time, with flexible, personalized AI systems accessible to everyone through talent density and continual adaptation.

What you will do

Design and operate distributed inference systems for LLMs, optimizing throughput, latency, cost, batching, scheduling, KV cache, and autoscaling across GPU fleets.
Build large-scale data pipelines using Ray Data, Spark, or equivalents to ingest, transform, and curate datasets for training and evaluation.
Debug production failure modes like tail-latency regressions, stragglers, GPU memory issues, and data corruption; define SLOs, build observability, and handle on-call.
Partner with researchers and ML engineers to scale experimental workloads from single nodes to reliable production systems.

Requirements

5+ years building and operating distributed systems in production
Deep experience with large-scale data/compute frameworks (Ray, Spark, Flink, Beam, Dask)
Strong Python fluency and systems language (Go, Rust, C++)
Working knowledge of GPU stack: CUDA, NCCL, mixed precision, memory layout
Experience with Kubernetes infrastructure, custom operators/schedulers
Track record owning production incidents end-to-end

Nice to have

Hands-on with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI)
Modern lakehouse formats (Iceberg, Delta, Hudi)
Open-source contributions to relevant projects

Culture & Benefits

In-person collaboration in Bay Area with distributed global-first team and offsites
hirify.global Passport: annual travel stipend to explore new countries
Lunch stipend for weekly take-out or grocery delivery
Comprehensive medical benefits and generous paid time off

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →