Назад
Company hidden
4 дня назад

Distributed Systems Engineer, Data & Inference Platform (AI)

Формат работы
hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Distributed Systems Engineer, Data & Inference Platform (AI): Build and operate distributed inference services for LLMs at scale and large-scale data pipelines with an accent on throughput, latency optimization, and production reliability. Focus on designing economical inference systems, debugging elusive production failures, and productionizing experimental workloads from researchers and ML engineers.

Location: Hybrid in San Francisco; Bay Area

Company

Building efficient intelligence that evolves in real-time, with flexible, personalized AI systems accessible to everyone through talent density and continual adaptation.

What you will do

  • Design and operate distributed inference systems for LLMs, optimizing throughput, latency, cost, batching, scheduling, KV cache, and autoscaling across GPU fleets.
  • Build large-scale data pipelines using Ray Data, Spark, or equivalents to ingest, transform, and curate datasets for training and evaluation.
  • Debug production failure modes like tail-latency regressions, stragglers, GPU memory issues, and data corruption; define SLOs, build observability, and handle on-call.
  • Partner with researchers and ML engineers to scale experimental workloads from single nodes to reliable production systems.

Requirements

  • 5+ years building and operating distributed systems in production
  • Deep experience with large-scale data/compute frameworks (Ray, Spark, Flink, Beam, Dask)
  • Strong Python fluency and systems language (Go, Rust, C++)
  • Working knowledge of GPU stack: CUDA, NCCL, mixed precision, memory layout
  • Experience with Kubernetes infrastructure, custom operators/schedulers
  • Track record owning production incidents end-to-end

Nice to have

  • Hands-on with LLM inference engines (vLLM, SGLang, TensorRT-LLM, TGI)
  • Modern lakehouse formats (Iceberg, Delta, Hudi)
  • Open-source contributions to relevant projects

Culture & Benefits

  • In-person collaboration in Bay Area with distributed global-first team and offsites
  • hirify.global Passport: annual travel stipend to explore new countries
  • Lunch stipend for weekly take-out or grocery delivery
  • Comprehensive medical benefits and generous paid time off

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →