Назад
5 часов назад

Senior or Staff ML Systems Engineer (LLM)

200 000 - 275 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US/Canada
vacancy_detail.hirify_telegram_tooltipВакансия из Telegram канала -

Мэтч & Сопровод

Покажет вашу совместимость и напишет письмо

Описание вакансии

Senior or Staff ML Systems Engineer, LLMs

Company

TRM Labs

Conditions

1 day agoLeadSalary: 200K - 275KNorth America Remote Full Time Ai Jobs by TRM Labs

Skills

Tracing Agent Evaluation Vector Database Feature Store Langchain Terraform Monitoring Infrastructure Observability Ci/Cd Mlops Llm Model Versioning Model Registry Llamaindex Vllm Bentoml Triton Drift Detection Python Docker Kubernetes

About the Role

You will build and scale the technical infrastructure that powers large language models and agentic systems. You will create reusable CI/CD workflows for model training, evaluation, and deployment, automate model versioning and approval workflows, and implement compliance checks. You will design and operate modular AI infrastructure—vector databases, feature stores, model registries, and observability tooling—and embed models and agents into real-time applications. You will continuously evaluate and integrate state-of-the-art tools, monitor cost, latency, and performance, and run offline and online evaluation pipelines including regression tests and human-in-the-loop workflows. You will enable researchers by providing sandboxes, dashboards, and reproducible environments, and ensure data accuracy and reliability for model training and inference.

Requirements

  • Write high quality maintainable software primarily in Python
  • Strong background in scalable infrastructure including containerization and orchestration (Docker Kubernetes)
  • Experience with infrastructure as code and deployment (Terraform CI/CD pipelines)
  • Familiarity with monitoring and logging frameworks (Datadog Prometheus OpenTelemetry)
  • Knowledge of MLOps best practices including model versioning rollback strategies automated evaluation and drift detection
  • Experience with scalable model and agent serving infrastructure (vLLM Triton BentoML)
  • Experience deploying and maintaining LLM and agentic workflows in production including monitoring cost latency and performance
  • Ability to capture traces for analysis and optimize prompt response flows with real time data access
  • Strong ownership pragmatism and ability to balance infrastructure elegance with iterative delivery

Responsibilities

  • Build reusable CI/CD workflows for model training evaluation and deployment
  • Automate model versioning approval workflows and compliance checks
  • Design and maintain modular scalable AI infrastructure including vector databases feature stores model registries and observability tooling
  • Embed AI models and agents into real time applications and workflows
  • Evaluate and integrate state of the art AI tools and libraries
  • Drive AI reliability governance and ensure compliance security and uptime
  • Deploy infrastructure for offline and online evaluation including regression testing cost monitoring and human in the loop workflows
  • Provide sandboxes dashboards and reproducible environments to accelerate research
  • Ensure data accuracy consistency and reliability for model training and inferencing

Benefits

  • Equity plan
  • Remote work

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник -