Staff Platform Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Platform Reliability Engineer (SRE/Kubernetes): Building and optimizing the Tempest scale and reliability platform for AI-driven organizations with an accent on performance profiling and infrastructure automation. Focus on diagnosing system bottlenecks, improving observability through Prometheus and New Relic, and scaling the platform across multi-cloud environments.
Location: Remote (Must be based in the US)
Salary: $185,000 - $230,000 USD
Company
builds software that helps the largest AI-driven organizations develop and operate advanced data science and AI solutions at scale.
What you will do
- Serve as the technical owner of Tempest, ensuring the scale and reliability platform remains extensible and aligned with infrastructure needs.
- Diagnose and resolve performance bottlenecks and resource misconfigurations in production Kubernetes environments.
- Deliver data-driven sizing recommendations for customer-facing documentation based on empirical testing.
- Enhance observability by improving Prometheus and New Relic instrumentation to pinpoint root causes.
- Operationalize and enable scale testing across multiple cloud providers.
- Build infrastructure automation to increase the operational efficiency of a small engineering team.
Requirements
- Experience in SRE or platform engineering operating distributed systems in production Kubernetes.
- Strong proficiency in Python for orchestration, infrastructure automation, and systems integration.
- Expertise with observability stacks including Prometheus, Grafana, and New Relic.
- Proven ability to profile services and identify resource bottlenecks to ship durable fixes.
- Familiarity with performance and load testing tools such as Locust or k6.
- Must be based in the United States.
Culture & Benefits
- Comprehensive benefits package including medical, dental, and vision insurance.
- Financial perks such as a 401(k) plan, equity, and company bonuses.
- Wellness stipends to support employee health.
- Remote-first, asynchronous work environment.
- Culture of continuous improvement, teaching, and learning.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →