TL;DR
Senior Site Reliability Engineer (AI/ML Inference): Owning the reliability, performance, and observability of a massive-scale AI inference platform with an accent on designing telemetry, optimizing Kubernetes autoscalers, and hardening distributed back-end systems. Focus on building self-healing systems, debugging performance from kernel to application layer, and ensuring flawless behavior under extreme load.
Location: Remote (Europe or United States)
Company
hirify.global is an AI cloud computing company serving the global AI economy, building tools and resources for customers to solve real-world AI/ML challenges.
What you will do
- Own the reliability, performance, and observability of the entire inference stack.
- Design and refine telemetry pipelines (metrics, logs, traces) for actionable insight.
- Tune Kubernetes autoscalers and craft Terraform modules for cluster resilience.
- Harden request-routing and retry logic to prevent user-facing failures.
- Detect, isolate, and remediate problems using automation and runbooks.
- Drive post-mortem culture to prevent incident recurrence.
Requirements
- Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform.
- Proficiency in infrastructure-as-code principles and practices.
- Comfortable scripting in Python or Bash.
- Understanding of alert design and SLOs for high-throughput APIs.
- Experience with distributed back-end failures in production environments.
Nice to have
- Experience shepherding GPU-heavy workloads (e.g., with vLLM, Triton, Ray).
- Background in MLOps or model-hosting platforms.
Culture & Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within the company.
- Flexible working arrangements.
- Dynamic and collaborative work environment that values initiative and innovation.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →