Назад
Company hidden
1 день назад

Site Reliability Engineer (AI)

350 000 - 475 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Релокация
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (AI): Driving the end-to-end reliability of the Tinker fine-tuning API with an accent on distributed training systems, production observability, and incident response. Focus on hardening multi-tenant isolation, maximizing GPU utilization through resource scheduling, and building resilient recovery systems for long-running distributed jobs.

Location: Based in San Francisco, California. Relocation support provided.

Salary: $350,000 – $475,000 USD

Company

hirify.global is building the future of collaborative general intelligence, creating the Tinker fine-tuning API to empower researchers and developers to customize frontier AI models.

What you will do

  • Define and own end-to-end reliability, encompassing CI/CD flows, production observability, and incident response.
  • Develop Service Level Objectives (SLOs) for distributed training systems to balance reliability, latency, and velocity.
  • Design and implement comprehensive monitoring and observability across the full training path.
  • Lead incident response for the Tinker platform, ensuring rapid recovery and implementing systematic improvements.
  • Harden multi-tenant isolation and resource scheduling to maximize GPU utilization without compromising data separation.
  • Collaborate with security teams to identify and address production vulnerabilities.

Requirements

  • Bachelor's degree or equivalent experience in Computer Science, Engineering, or a similar field.
  • Professional experience in distributed systems, cloud infrastructure, or site reliability engineering.
  • Proficiency in writing software to automate reliability tooling and solve complex infrastructure problems.
  • Proven track record in production incident response, postmortems, and reliability improvement.
  • Strong communication skills and experience coordinating across engineering and research teams.
  • Must be based in or able to relocate to San Francisco, California.

Nice to have

  • Deep experience operating production cloud services at scale.
  • Background in distributed training frameworks and understanding of infrastructure failure modes in training.
  • Experience building checkpoint and recovery systems for long-running distributed workloads.
  • Expertise in operating and tuning Kubernetes clusters handling heterogeneous GPU workloads.

Culture & Benefits

  • Generous health, dental, and vision benefits.
  • Unlimited PTO and paid parental leave.
  • Relocation support for candidates moving to San Francisco.
  • Visa sponsorship availability for the right fit.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →