Назад
Company hidden
3 дня назад

Site Reliability Engineer (AI)

150 000 - 200 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Site Reliability Engineer (AI/Infrastructure): Ensuring the stability, resilience, and observability of a global AI infrastructure platform with an accent on SLI/SLO adoption and incident response. Focus on automating operational workflows, reducing toil, and improving GPU performance visibility in high-scale distributed systems.

Location: Remote, USA

Salary: $150,000 – $200,000 USD

Company

hirify.global is a foundational platform for developers to build and run custom AI systems that scale, focusing on high-performance infrastructure for modern AI workloads.

What you will do

  • Define and enforce reliability standards, SLIs, and SLOs across critical services.
  • Lead incident response, coordinate mitigation, and conduct blameless postmortems.
  • Design and improve observability systems using Prometheus, Grafana, and custom tooling.
  • Automate recurring operational workflows using Python, Go, and Bash to reduce toil.
  • Partner with engineering teams to enhance fault tolerance and system resilience.
  • Perform production readiness reviews for new features and identify systemic risks.

Requirements

  • 5+ years of experience in SRE, Reliability Engineering, or Production Engineering.
  • Must be based in the USA.
  • Strong expertise in Linux systems, networking, and containerized production environments.
  • Deep understanding of distributed systems and failure modes.
  • Proven experience in incident response leadership and managing SLIs/SLOs.
  • Strong scripting/programming skills and excellent written communication.

Nice to have

  • Experience with GPU infrastructure or AI/ML platforms.
  • Background in high-growth, large-scale environments or startup experience.
  • Familiarity with Infrastructure as Code (IaC) and GPU observability tooling.
  • Experience building internal reliability platforms or frameworks.

Culture & Benefits

  • Meaningful equity and stock options in a fast-growing company.
  • Generous medical, dental, and vision plans.
  • Flexible PTO policy to support recharge and work-life balance.
  • Remote-first culture with collaborative teams utilizing Slack for communication.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →