Distributed Systems Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Distributed Systems Engineer (AI): Building platform that schedules, routes, and operates AI workloads across thousands of nodes with an accent on distributed scheduling, resource allocation, reliability, and fault tolerance. Focus on designing orchestration systems, handling failure modes, and optimizing performance across compilers, runtimes, and heterogeneous hardware.
Location: San Francisco, CA or New York City, NY
Salary: $120,000-$400,000
Company
AI infrastructure startup with $80M Series A, deployments with Fortune 500 and AI-native companies, working with foundation labs and hyperscalers.
What you will do
- Build distributed scheduling and orchestration systems for large-scale AI workloads.
- Implement resource allocation across thousands of nodes in production.
- Design reliability, fault tolerance, and failure handling mechanisms.
- Work across stack with compilers, runtimes, and hardware for performance and correctness.
Requirements
- Proven ownership of distributed systems in production.
- Strong Kubernetes experience.
- Deep understanding of concurrency, failure modes, and system tradeoffs.
- Strong programming in Go, C++, or Python.
Nice to have
- Experience with ML inference systems or performance-critical workloads.
- Familiarity with scheduling, queues, or resource management systems.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →