Site Reliability Engineer (AI)

350 000 - 475 000$

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Релокация

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI): Driving the end-to-end reliability of the Tinker fine-tuning API with an accent on distributed training systems, production observability, and incident response. Focus on hardening multi-tenant isolation, maximizing GPU utilization through resource scheduling, and building resilient recovery systems for long-running distributed jobs.

Location: Based in San Francisco, California. Relocation support provided.

Salary: $350,000 – $475,000 USD

Company

hirify.global is building the future of collaborative general intelligence, creating the Tinker fine-tuning API to empower researchers and developers to customize frontier AI models.

What you will do

Define and own end-to-end reliability, encompassing CI/CD flows, production observability, and incident response.
Develop Service Level Objectives (SLOs) for distributed training systems to balance reliability, latency, and velocity.
Design and implement comprehensive monitoring and observability across the full training path.
Lead incident response for the Tinker platform, ensuring rapid recovery and implementing systematic improvements.
Harden multi-tenant isolation and resource scheduling to maximize GPU utilization without compromising data separation.
Collaborate with security teams to identify and address production vulnerabilities.

Requirements

Bachelor's degree or equivalent experience in Computer Science, Engineering, or a similar field.
Professional experience in distributed systems, cloud infrastructure, or site reliability engineering.
Proficiency in writing software to automate reliability tooling and solve complex infrastructure problems.
Proven track record in production incident response, postmortems, and reliability improvement.
Strong communication skills and experience coordinating across engineering and research teams.
Must be based in or able to relocate to San Francisco, California.

Nice to have

Deep experience operating production cloud services at scale.
Background in distributed training frameworks and understanding of infrastructure failure modes in training.
Experience building checkpoint and recovery systems for long-running distributed workloads.
Expertise in operating and tuning Kubernetes clusters handling heterogeneous GPU workloads.

Culture & Benefits

Generous health, dental, and vision benefits.
Unlimited PTO and paid parental leave.
Relocation support for candidates moving to San Francisco.
Visa sponsorship availability for the right fit.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →