Ml Platform Engineer (Ai)

Формат работы

onsite

Тип работы

fulltime

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Ml Platform Engineer (AI): Building the infrastructure that powers large-scale ML training and data processing for autonomous driving with an accent on scalable orchestration, distributed compute, and production-grade tooling. Focus on making training workloads reliable, cost-efficient, and fast on Kubernetes.

Location: Must be authorized to work in the U.S. Relocation sponsorship and remote work options are not available.

Company

hirify.global builds the infrastructure that powers large-scale ML training and data processing for autonomous driving.

What you will do

Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration.
Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance.
Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention.
Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes.
Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs.

Requirements

Strong proficiency in Python or Go; C++ is a plus.
Track record of designing and building scalable, maintainable systems and services.
Experience operating production services end-to-end: APIs, reliability practices, observability.
Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure.
Solid Linux and systems debugging skills: performance investigation, networking, storage/IO.
Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution.

Nice to have

Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling.
Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines.
Track record of optimizing resource usage and performance in distributed environments.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →