Plusможно открыть ещё 3 в бесплатном тарифе
11 hours ago
Senior Site Reliability Engineer
Мэтч & Сопровод
Покажет вашу совместимость и напишет письмо
Описание вакансии
Текст:
Location: Headquartered in Amsterdam with R&D hubs across Europe, North America, and Israel.
is leading a new era in cloud computing to serve the global AI economy.
Overview
You will own the reliability, performance, and observability of the entire inference stack. You will design telemetry pipelines, tune Kubernetes autoscalers, craft Terraform modules, and harden request-routing and retry logic. The goal is scaling the platform smoothly while hitting aggressive cost and reliability targets.
What you will do
- Design and refine telemetry pipelines to turn signal into actionable insight.
- Tune Kubernetes autoscalers to optimize GPU efficiency.
- Craft Terraform modules to build resilience into new clusters.
- Harden request-routing and retry logic to prevent transient failures.
- Automate incident detection, isolation, and remediation.
- Drive post-mortem culture to prevent recurrence.
Requirements
- Deep fluency with Kubernetes, Prometheus, Grafana, and Terraform.
- Comfortable scripting in Python or Bash.
- Understanding of alert design and SLOs for high-throughput APIs.
- Experience with GPU-heavy workloads (vLLM, Triton, Ray, or similar).
- Background in MLOps or model-hosting platforms.
Culture & Benefits
- Competitive salary and comprehensive benefits package.
- Opportunities for professional growth within .
- Hybrid working arrangements.
- Dynamic and collaborative work environment.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →