Senior Software Engineer II (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Software Engineer II (AI): Building and scaling Kubernetes-native research cluster platforms and sandbox infrastructure for agentic training with an accent on distributed systems, workload orchestration, and ML infrastructure. Focus on designing high-performance tools that enable researchers to train models at scale without managing underlying infrastructure.
Location: Must be based in or able to work from Sunnyvale, CA or Bellevue, WA
Salary: $182,000 - $242,000
Company
is a specialized cloud provider delivering high-performance infrastructure for AI, trusted by leading labs and enterprises to accelerate breakthroughs in machine learning.
What you will do
- Design and build a complete research cluster experience including CLI, job configuration schemas, and Kubernetes operators.
- Own the Python SDK for sandbox infrastructure to enable large-scale RL training and agent rollouts.
- Collaborate directly with customers at large AI labs to understand their supercomputing stacks and translate needs into system designs.
- Develop and maintain Kubernetes-native primitives for compute, storage, and networking.
- Write technical documentation to help customers run popular OSS training frameworks on the platform.
Requirements
- 8–12+ years of experience in distributed systems, ML infrastructure, or developer platforms.
- Deep Kubernetes expertise including custom controllers, operators, scheduling, and CRDs at scale.
- U.S. work authorization required due to export control compliance (U.S. citizen, permanent resident, or eligible for export authorization).
- Proven track record of shipping production-grade infrastructure systems.
- Strong communication skills for direct customer interaction and system design translation.
- Understanding of distributed training workflows and researcher productivity bottlenecks.
Nice to have
- Experience building internal ML platforms at large-scale training companies.
- Familiarity with agentic AI, RL training, and sandbox isolation techniques.
- Background with Slurm, Ray, or similar workload orchestration tools.
- Experience with container runtimes like gVisor or Kata.
- OSS contributions to Kubernetes SIGs, Ray, or PyTorch.
Culture & Benefits
- Comprehensive medical, dental, and vision insurance (100% paid).
- 401(k) with generous employer match.
- Flexible PTO and casual work environment.
- Support for family-forming and mental wellness.
- Catered lunches in office and data center locations.
- Opportunities for equity awards and employee stock purchase programs.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →