Senior Software Engineer (Cluster Orchestration)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Software Engineer (Cluster Orchestration): Developing and optimizing orchestration platforms like SUNK (Slurm on Kubernetes) for large-scale AI training and inference with an accent on distributed systems reliability and high-performance scheduling. Focus on eliminating infrastructure bottlenecks, defining SLIs/SLOs for critical services, and driving improvements in system throughput and latency.
Location: Hybrid (Sunnyvale, CA / Bellevue, WA). Remote considered for candidates located more than 30 miles from an office; must be a U.S. person for export control compliance.
Salary: $139,000–$204,000
Company
is an AI Hyperscaler providing high-performance cloud infrastructure for AI, trusted by leading labs and enterprises.
What you will do
- Own and manage services within the cluster orchestration platform.
- Lead design and code reviews to ensure system scalability and robustness.
- Define and monitor SLIs/SLOs to improve reliability and performance.
- Decompose complex projects into actionable milestones and drive execution.
- Mentor junior engineers and strengthen operational practices.
- Optimize infrastructure throughput, latency, and resilience for GPU-based workloads.
Requirements
- 3–5 years of professional software engineering experience in distributed systems or cloud services.
- Strong proficiency in Go (Python or C++ experience is a plus).
- Hands-on experience operating Kubernetes in a production environment.
- Must be a U.S. person (citizen, permanent resident, refugee, or asylee) due to export control regulations.
- Familiarity with observability stacks such as Prometheus, Grafana, and OpenTelemetry.
- Ability to analyze performance metrics like P95/P99 latency and throughput to drive reliability improvements.
Nice to have
- Experience with orchestration/workflow engines (Ray, Kubeflow, Kueue, Istio, Knative, Argo Workflows).
- Background in GPU-based applications or machine learning pipelines.
- Knowledge of advanced scheduling concepts like pre-emption and quota enforcement.
- Experience with incident management and post-incident review practices.
Culture & Benefits
- Comprehensive 100% paid medical, dental, and vision insurance.
- Generous 401(k) employer match and stock purchase program (ESPP).
- Flexible PTO policy and catered daily lunches in office hubs.
- Support for mental wellness and family-forming (Spring Health, Carrot, Kinside).
- Casual work environment centered on innovative disruption.
- Quarterly team gatherings to support collaboration in a hybrid/remote model.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →