TL;DR
Software Engineer, AI Reliability (AI): Improving reliability across critical AI serving paths from SDK through network, API layers, serving infrastructure, and accelerators, with an accent on designing and implementing monitoring, high-availability infrastructure, and leading incident response for large language model systems. Focus on systematic improvements, understanding system composition, and ensuring Claude's reliability for users.
Location: Hybrid in San Francisco, New York City, or Seattle (USA). Visa sponsorship available.
Salary: $325,000 – $485,000 USD annually
Company
hirify.global is a public benefit corporation focused on creating reliable, interpretable, and steerable AI systems.
What you will do
- Develop appropriate Service Level Objectives for large language model serving systems.
- Design and implement monitoring and observability systems across the token path.
- Assist in the design and implementation of high-availability serving infrastructure across multiple regions and cloud providers.
- Lead incident response for critical AI services, ensuring rapid recovery, thorough incident reviews, and systematic improvements.
- Support the reliability of safeguard model serving, critical for both site reliability and hirify.global's safety commitments.
Requirements
- At least a Bachelor's degree in a related field or equivalent experience.
- Strong distributed systems, infrastructure, or reliability backgrounds (SREs).
- Curiosity and comfort jumping into unfamiliar systems during an incident to drive resolution.
- Holistic thinking about how systems compose and where the seams are.
- Excellent communication and collaboration skills to partner across the entire company.
- Care about users and feel ownership over outcomes, even for systems you don't own.
Nice to have
- Experience as an SRE, Production Engineer, or in similar reliability-focused roles on large scale systems.
- Experience operating large-scale model serving or training infrastructure (>1000 GPUs).
- Experience with one or more ML hardware accelerators (GPUs, TPUs, Trainium).
- Understanding of ML-specific networking optimizations like RDMA and InfiniBand.
- Expertise in AI-specific observability tools and frameworks.
- Experience with chaos engineering and systematic resilience testing.
Culture & Benefits
- Hybrid work policy: Expectation to be in one of our offices at least 25% of the time.
- Competitive compensation and benefits, optional equity donation matching.
- Generous vacation and parental leave, flexible working hours.
- Collaborative environment focused on advancing long-term goals of steerable, trustworthy AI.
- Lovely office space in San Francisco to collaborate with colleagues.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →