Staff Observability Platform Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Observability Platform Engineer (AI): Building and evolving a robust observability platform for GPU clusters and AI workloads with an accent on telemetry pipelines, system reliability, and operational visibility. Focus on designing scalable solutions for distributed systems, mentoring engineering teams, and driving architectural decisions to ensure high-performance infrastructure for AI development.
Location: Must be based in the US
Company
is a GPU cloud infrastructure provider engineered specifically for AI start-ups and large enterprises, focusing on high-performance computing and cost-effective AI development.
What you will do
- Design and evolve observability platforms covering metrics, logs, traces, and telemetry pipelines.
- Lead the implementation of scalable solutions supporting growing GPU and AI infrastructure.
- Partner with SRE, platform, and AI/ML teams to embed observability across the infrastructure lifecycle.
- Develop standards and reusable patterns to simplify observability adoption.
- Identify reliability risks and operational blind spots to proactively address system issues.
- Mentor engineers and provide technical guidance through design and code reviews.
Requirements
- Must be based in the US
- 6+ years of experience in SRE, platform, or observability engineering.
- Deep hands-on experience with tools like Prometheus, Thanos, Grafana, Loki, OpenTelemetry, or ClickHouse.
- Strong software engineering proficiency in Go or Python.
- Experience operating and troubleshooting Kubernetes-based platforms at scale.
- Strong understanding of modern observability practices and telemetry pipelines.
Nice to have
- Experience with GPU, AI/ML, or HPC environments.
- Familiarity with Slurm or Kubernetes GPU scheduling.
- Experience with high-volume streaming technologies like Kafka or Vector.
- Knowledge of observability challenges related to model training and inference workloads.
Culture & Benefits
- Focus on relentless innovation, ownership, and accountability.
- Opportunity to build infrastructure powering the future of AI.
- Collaborative environment working across SRE, platform, and AI/ML teams.
- Commitment to transparency and open communication.
- Inclusive culture encouraging applications from diverse backgrounds.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →