Назад
Company hidden
2 месяца назад

Ml Platform Engineer (Ai)

Формат работы
onsite
Тип работы
fulltime
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Ml Platform Engineer (AI): Building the infrastructure that powers large-scale ML training and data processing for autonomous driving with an accent on scalable orchestration, distributed compute, and production-grade tooling. Focus on making training workloads reliable, cost-efficient, and fast on Kubernetes.

Location: Must be authorized to work in the U.S. Relocation sponsorship and remote work options are not available.

Company

hirify.global builds the infrastructure that powers large-scale ML training and data processing for autonomous driving.

What you will do

  • Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration.
  • Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance.
  • Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention.
  • Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes.
  • Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs.

Requirements

  • Strong proficiency in Python or Go; C++ is a plus.
  • Track record of designing and building scalable, maintainable systems and services.
  • Experience operating production services end-to-end: APIs, reliability practices, observability.
  • Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure.
  • Solid Linux and systems debugging skills: performance investigation, networking, storage/IO.
  • Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution.

Nice to have

  • Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling.
  • Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines.
  • Track record of optimizing resource usage and performance in distributed environments.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →