Site Reliability Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Site Reliability Engineer (AI): Driving the end-to-end reliability of the Tinker fine-tuning API with an accent on distributed training systems, production observability, and incident response. Focus on hardening multi-tenant isolation, maximizing GPU utilization through resource scheduling, and building resilient recovery systems for long-running distributed jobs.
Location: Based in San Francisco, California. Relocation support provided.
Salary: $350,000 – $475,000 USD
Company
is building the future of collaborative general intelligence, creating the Tinker fine-tuning API to empower researchers and developers to customize frontier AI models.
What you will do
- Define and own end-to-end reliability, encompassing CI/CD flows, production observability, and incident response.
- Develop Service Level Objectives (SLOs) for distributed training systems to balance reliability, latency, and velocity.
- Design and implement comprehensive monitoring and observability across the full training path.
- Lead incident response for the Tinker platform, ensuring rapid recovery and implementing systematic improvements.
- Harden multi-tenant isolation and resource scheduling to maximize GPU utilization without compromising data separation.
- Collaborate with security teams to identify and address production vulnerabilities.
Requirements
- Bachelor's degree or equivalent experience in Computer Science, Engineering, or a similar field.
- Professional experience in distributed systems, cloud infrastructure, or site reliability engineering.
- Proficiency in writing software to automate reliability tooling and solve complex infrastructure problems.
- Proven track record in production incident response, postmortems, and reliability improvement.
- Strong communication skills and experience coordinating across engineering and research teams.
- Must be based in or able to relocate to San Francisco, California.
Nice to have
- Deep experience operating production cloud services at scale.
- Background in distributed training frameworks and understanding of infrastructure failure modes in training.
- Experience building checkpoint and recovery systems for long-running distributed workloads.
- Expertise in operating and tuning Kubernetes clusters handling heterogeneous GPU workloads.
Culture & Benefits
- Generous health, dental, and vision benefits.
- Unlimited PTO and paid parental leave.
- Relocation support for candidates moving to San Francisco.
- Visa sponsorship availability for the right fit.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →