Назад
Company hidden
5 дней назад

Senior Software Engineer (Cluster Orchestration)

139 000 - 204 000$
Формат работы
remote (только USA)/hybrid
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Software Engineer (Cluster Orchestration): Developing and optimizing orchestration platforms like SUNK (Slurm on Kubernetes) for large-scale AI training and inference with an accent on distributed systems reliability and high-performance scheduling. Focus on eliminating infrastructure bottlenecks, defining SLIs/SLOs for critical services, and driving improvements in system throughput and latency.

Location: Hybrid (Sunnyvale, CA / Bellevue, WA). Remote considered for candidates located more than 30 miles from an office; must be a U.S. person for export control compliance.

Salary: $139,000–$204,000

Company

hirify.global is an AI Hyperscaler providing high-performance cloud infrastructure for AI, trusted by leading labs and enterprises.

What you will do

  • Own and manage services within the cluster orchestration platform.
  • Lead design and code reviews to ensure system scalability and robustness.
  • Define and monitor SLIs/SLOs to improve reliability and performance.
  • Decompose complex projects into actionable milestones and drive execution.
  • Mentor junior engineers and strengthen operational practices.
  • Optimize infrastructure throughput, latency, and resilience for GPU-based workloads.

Requirements

  • 3–5 years of professional software engineering experience in distributed systems or cloud services.
  • Strong proficiency in Go (Python or C++ experience is a plus).
  • Hands-on experience operating Kubernetes in a production environment.
  • Must be a U.S. person (citizen, permanent resident, refugee, or asylee) due to export control regulations.
  • Familiarity with observability stacks such as Prometheus, Grafana, and OpenTelemetry.
  • Ability to analyze performance metrics like P95/P99 latency and throughput to drive reliability improvements.

Nice to have

  • Experience with orchestration/workflow engines (Ray, Kubeflow, Kueue, Istio, Knative, Argo Workflows).
  • Background in GPU-based applications or machine learning pipelines.
  • Knowledge of advanced scheduling concepts like pre-emption and quota enforcement.
  • Experience with incident management and post-incident review practices.

Culture & Benefits

  • Comprehensive 100% paid medical, dental, and vision insurance.
  • Generous 401(k) employer match and stock purchase program (ESPP).
  • Flexible PTO policy and catered daily lunches in office hubs.
  • Support for mental wellness and family-forming (Spring Health, Carrot, Kinside).
  • Casual work environment centered on innovative disruption.
  • Quarterly team gatherings to support collaboration in a hybrid/remote model.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →