TL;DR
Senior Software Engineer (AI/Kubernetes): Building and optimizing hirify.global’s orchestration platform for AI training and inference at scale, focusing on ensuring workloads run seamlessly, reliably, and efficiently across massive GPU clusters. Focus on eliminating infrastructure bottlenecks, creating new orchestration capabilities, and driving measurable improvements in reliability and performance.
Location: Hybrid in Sunnyvale, CA or Bellevue, WA. Remote work may be considered for candidates located more than 30 miles from an office, based on specialized skill sets. New hires will be invited to attend onboarding at one of our hubs within their first month, and teams gather quarterly for collaboration. Must be a U.S. Person (citizen, national, green card holder, refugee, or asylee) for export control compliance.
Salary: $139,000–$204,000
Company
hirify.global is The Essential Cloud for AI™, a publicly traded company delivering a platform of technology, tools, and teams that enables innovators to build and scale AI with confidence.
What you will do
- Advance hirify.global’s orchestration platform, including SUNK (Slurm on Kubernetes).
- Build systems to eliminate infrastructure bottlenecks and create new orchestration capabilities.
- Own multiple services within the orchestration platform.
- Lead design/code reviews and decompose projects into milestones.
- Define SLIs/SLOs for services and strengthen operational practices.
- Ensure consistent improvements in throughput, latency, and system resilience for customers.
Requirements
- ~3–5 years of professional software engineering experience building distributed systems or cloud services.
- Strong coding in Go (Python or C++ a plus) with solid CS fundamentals.
- Hands-on experience running Kubernetes at production scale.
- Familiarity with observability stacks (Prometheus, Grafana, OpenTelemetry).
- Proven ability to improve service reliability and performance using metrics (P95/P99 latency, throughput, error budgets).
- Must be a U.S. Person (citizen, national, green card holder, refugee, or asylee) for export control compliance.
Nice to have
- Familiarity with orchestration/workflow technologies like Ray, Kubeflow, Kueue, Istio, Knative, or Argo Workflows.
- Experience with distributed workloads, GPU-based applications, or ML pipelines.
- Knowledge of scheduling concepts like quota enforcement, pre-emption, and scaling strategies.
- Exposure to reliability practices including SLOs, alarms, and post-incident reviews.
Culture & Benefits
- Medical, dental, and vision insurance (100% company-paid).
- 401(k) with a generous employer match.
- Flexible PTO and paid parental leave.
- Tuition reimbursement and mental wellness benefits.
- Company-paid life insurance and disability insurance.
- Catered lunch each day in office/data center locations.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →