Ml Platform Engineer (Ai)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Ml Platform Engineer (AI): Building the infrastructure that powers large-scale ML training and data processing for autonomous driving with an accent on scalable orchestration, distributed compute, and production-grade tooling. Focus on making training workloads reliable, cost-efficient, and fast on Kubernetes.
Location: Must be authorized to work in the U.S. Relocation sponsorship and remote work options are not available.
Company
builds the infrastructure that powers large-scale ML training and data processing for autonomous driving.
What you will do
- Build and scale our ML compute platform on Kubernetes, using Argo Workflows for training, evaluation, and data processing orchestration.
- Design and implement core platform capabilities, including a Ray-based internal SDK for distributed execution, and multi-tenant resource governance.
- Improve end-to-end training throughput and platform efficiency by optimizing data access patterns, caching, and removing bottlenecks in storage, network, and resource contention.
- Work directly with ML teams to debug complex workload issues, drive root-cause analysis, and turn recurring problems into platform-level fixes.
- Evaluate, integrate and extend open-source tooling (Argo Workflows, Ray, Kubernetes ecosystem) to meet evolving platform needs.
Requirements
- Strong proficiency in Python or Go; C++ is a plus.
- Track record of designing and building scalable, maintainable systems and services.
- Experience operating production services end-to-end: APIs, reliability practices, observability.
- Deep knowledge of Kubernetes: how scheduling, resource management, controllers, and pod lifecycle actually behave under pressure.
- Solid Linux and systems debugging skills: performance investigation, networking, storage/IO.
- Ability to troubleshoot complex production issues across logs, metrics, and traces and drive them to resolution.
Nice to have
- Experience with Argo Workflows, Ray, MLflow, or comparable distributed ML tooling.
- Hands-on experience building or operating large-scale ML training systems: GPU scheduling, distributed training, training data pipelines.
- Track record of optimizing resource usage and performance in distributed environments.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →