Senior Software Engineer (Cluster Orchestration)

139 000 - 204 000$

Формат работы

remote (только USA)/hybrid

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Senior Software Engineer (Cluster Orchestration): Developing and optimizing orchestration platforms like SUNK (Slurm on Kubernetes) for large-scale AI training and inference with an accent on distributed systems reliability and high-performance scheduling. Focus on eliminating infrastructure bottlenecks, defining SLIs/SLOs for critical services, and driving improvements in system throughput and latency.

Location: Hybrid (Sunnyvale, CA / Bellevue, WA). Remote considered for candidates located more than 30 miles from an office; must be a U.S. person for export control compliance.

Salary: $139,000–$204,000

Company

hirify.global is an AI Hyperscaler providing high-performance cloud infrastructure for AI, trusted by leading labs and enterprises.

What you will do

Own and manage services within the cluster orchestration platform.
Lead design and code reviews to ensure system scalability and robustness.
Define and monitor SLIs/SLOs to improve reliability and performance.
Decompose complex projects into actionable milestones and drive execution.
Mentor junior engineers and strengthen operational practices.
Optimize infrastructure throughput, latency, and resilience for GPU-based workloads.

Requirements

3–5 years of professional software engineering experience in distributed systems or cloud services.
Strong proficiency in Go (Python or C++ experience is a plus).
Hands-on experience operating Kubernetes in a production environment.
Must be a U.S. person (citizen, permanent resident, refugee, or asylee) due to export control regulations.
Familiarity with observability stacks such as Prometheus, Grafana, and OpenTelemetry.
Ability to analyze performance metrics like P95/P99 latency and throughput to drive reliability improvements.

Nice to have

Experience with orchestration/workflow engines (Ray, Kubeflow, Kueue, Istio, Knative, Argo Workflows).
Background in GPU-based applications or machine learning pipelines.
Knowledge of advanced scheduling concepts like pre-emption and quota enforcement.
Experience with incident management and post-incident review practices.

Culture & Benefits

Comprehensive 100% paid medical, dental, and vision insurance.
Generous 401(k) employer match and stock purchase program (ESPP).
Flexible PTO policy and catered daily lunches in office hubs.
Support for mental wellness and family-forming (Spring Health, Carrot, Kinside).
Casual work environment centered on innovative disruption.
Quarterly team gatherings to support collaboration in a hybrid/remote model.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →