Senior Software Engineer II AI Workload Orchestration (AI Engineering)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Software Engineer II (AI Workload Orchestration): You will help build and operate ’s Kubernetes-native platform for admitting, scheduling, and operating AI workloads at scale with an accent on reliability and performance improvements. Focus on scaling the system as customer demand and workload complexity continue to grow.
Location: Sunnyvale, CA / Bellevue, WA
Salary: $165,000 to $242,000
Company
is The Essential Cloud for AI™ delivering a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence.
What you will do
- Design, build, and operate Kubernetes-native services for AI workload orchestration and scheduling.
- Own one or more platform components end-to-end, including design, implementation, testing, and on-call support.
- Improve scheduling latency, cluster utilization, and workload reliability through metrics-driven engineering.
- Contribute to architectural discussions across services and influence design decisions within the platform.
- Work closely with adjacent teams to ensure clean interfaces and integrations.
- Mentor junior engineers and raise the quality bar for code, design, and operations.
Requirements
- 5–8 years of professional software engineering experience in distributed systems, cloud infrastructure, or platform engineering.
- Strong experience building production systems in Go (Python or C++ a plus).
- Solid understanding of Kubernetes fundamentals, APIs, controllers, and operating services in production.
- Experience working with scheduling, resource management, or quota-based systems.
- Proven ability to improve system reliability and performance using data and operational metrics.
- Comfortable owning services in production and participating in on-call rotations.
Nice to have
- Experience with Kubernetes-native orchestration frameworks such as Kueue, Volcano, Ray, Kubeflow, or Argo Workflows.
- Familiarity with GPU-based workloads, ML training, or inference pipelines.
- Knowledge of scheduling concepts such as quota enforcement, pre-emption, and backfilling.
- Experience with reliability practices including SLOs, alerting, and incident response.
- Exposure to AI infrastructure, HPC, or large-scale distributed compute environments.
Culture & Benefits
- Medical, dental, and vision insurance - 100% paid for by .
- Flexible Spending Account and Health Savings Account.
- 401(k) with a generous employer match.
- Flexible PTO.
- A casual work environment.
- A work culture focused on innovative disruption.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →