4 месяца назад

Software Engineer, AI Reliability (AI)

325 000 - 485 000$

Формат работы

hybrid

Тип работы

fulltime

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Software Engineer, AI Reliability (AI): Improving reliability across critical AI serving paths from SDK through network, API layers, serving infrastructure, and accelerators, with an accent on designing and implementing monitoring, high-availability infrastructure, and leading incident response for large language model systems. Focus on systematic improvements, understanding system composition, and ensuring Claude's reliability for users.

Location: Hybrid in San Francisco, New York City, or Seattle (USA). Visa sponsorship available.

Salary: $325,000 – $485,000 USD annually

Company

Anthropic is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems.

What you will do

Develop appropriate Service Level Objectives for large language model serving systems.
Design and implement monitoring and observability systems across the token path.
Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
Support the reliability of safeguard model serving, critical for both site reliability and Anthropic's safety commitments.

Requirements

At least a Bachelor's degree in a related field or equivalent experience.
Strong distributed systems, infrastructure, or reliability backgrounds (SREs).
Curiosity and comfort jumping into unfamiliar systems during an incident to drive resolution.
Holistic thinking about how systems compose and where the seams are.
Excellent communication and collaboration skills to partner across the entire company.
Care about users and feel ownership over outcomes, even for systems you don't own.

Nice to have

Experience as an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems.
Experience operating large-scale model serving or training infrastructure (>1000 GPUs).
Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
Understanding of ML-specific networking optimizations like RDMA and InfiniBand.
Expertise in AI-specific observability tools and frameworks.
Experience with chaos engineering and systematic resilience testing.

Culture & Benefits

Hybrid work policy: Expectation to be in one of our offices at least 25% of the time.
Competitive compensation and benefits, optional equity donation matching.
Generous vacation and parental leave, flexible working hours.
Collaborative environment focused on advancing long-term goals of steerable, trustworthy AI.
Lovely office space in San Francisco to collaborate with colleagues.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →