Назад
Company hidden
5 часов назад

Software Engineer, AI Reliability (AI)

325 000 - 485 000$
Формат работы
hybrid
Тип работы
fulltime
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Software Engineer, AI Reliability (AI): Improving reliability across critical AI serving paths from SDK through network, API layers, serving infrastructure, and accelerators, with an accent on designing and implementing monitoring, high-availability infrastructure, and leading incident response for large language model systems. Focus on systematic improvements, understanding system composition, and ensuring Claude's reliability for users.

Location: Hybrid in San Francisco, New York City, or Seattle (USA). Visa sponsorship available.

Salary: $325,000 – $485,000 USD annually

Company

hirify.global is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems.

What you will do

  • Develop appropriate Service Level Objectives for large language model serving systems.
  • Design and implement monitoring and observability systems across the token path.
  • Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
  • Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
  • Support the reliability of safeguard model serving, critical for both site reliability and hirify.global's safety commitments.

Requirements

  • At least a Bachelor's degree in a related field or equivalent experience.
  • Strong distributed systems, infrastructure, or reliability backgrounds (SREs).
  • Curiosity and comfort jumping into unfamiliar systems during an incident to drive resolution.
  • Holistic thinking about how systems compose and where the seams are.
  • Excellent communication and collaboration skills to partner across the entire company.
  • Care about users and feel ownership over outcomes, even for systems you don't own.

Nice to have

  • Experience as an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems.
  • Experience operating large-scale model serving or training infrastructure (>1000 GPUs).
  • Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
  • Understanding of ML-specific networking optimizations like RDMA and InfiniBand.
  • Expertise in AI-specific observability tools and frameworks.
  • Experience with chaos engineering and systematic resilience testing.

Culture & Benefits

  • Hybrid work policy: Expectation to be in one of our offices at least 25% of the time.
  • Competitive compensation and benefits, optional equity donation matching.
  • Generous vacation and parental leave, flexible working hours.
  • Collaborative environment focused on advancing long-term goals of steerable, trustworthy AI.
  • Lovely office space in San Francisco to collaborate with colleagues.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Текст вакансии взят без изменений

Источник - загрузка...