Site Reliability Engineer (AI)

150 000 - 200 000$

Формат работы

remote (только USA)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI/Infrastructure): Ensuring the stability, resilience, and observability of a global AI infrastructure platform with an accent on SLI/SLO adoption and incident response. Focus on automating operational workflows, reducing toil, and improving GPU performance visibility in high-scale distributed systems.

Location: Remote, USA

Salary: $150,000 – $200,000 USD

Company

hirify.global is a foundational platform for developers to build and run custom AI systems that scale, focusing on high-performance infrastructure for modern AI workloads.

What you will do

Define and enforce reliability standards, SLIs, and SLOs across critical services.
Lead incident response, coordinate mitigation, and conduct blameless postmortems.
Design and improve observability systems using Prometheus, Grafana, and custom tooling.
Automate recurring operational workflows using Python, Go, and Bash to reduce toil.
Partner with engineering teams to enhance fault tolerance and system resilience.
Perform production readiness reviews for new features and identify systemic risks.

Requirements

5+ years of experience in SRE, Reliability Engineering, or Production Engineering.
Must be based in the USA.
Strong expertise in Linux systems, networking, and containerized production environments.
Deep understanding of distributed systems and failure modes.
Proven experience in incident response leadership and managing SLIs/SLOs.
Strong scripting/programming skills and excellent written communication.

Nice to have

Experience with GPU infrastructure or AI/ML platforms.
Background in high-growth, large-scale environments or startup experience.
Familiarity with Infrastructure as Code (IaC) and GPU observability tooling.
Experience building internal reliability platforms or frameworks.

Culture & Benefits

Meaningful equity and stock options in a fast-growing company.
Generous medical, dental, and vision plans.
Flexible PTO policy to support recharge and work-life balance.
Remote-first culture with collaborative teams utilizing Slack for communication.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →