Назад
Company hidden
18 часов назад

Senior Site Reliability Engineer (AI Infrastructure Operations)

100 000 - 200 000$
Формат работы
remote (только USA)
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior Site Reliability Engineer (AI Infrastructure Operations): Design, build, and operate reliable, scalable infrastructure across our GPU cloud with an accent on hands-on engineering, system reliability, and operational excellence. Focus on improving performance, automating operations, and ensuring platform stability at scale.

Location: US

Salary: $100,000 - $200,000 USD

Company

hirify.global is the GPU cloud engineered for AI—purpose-built to deliver high-performance, cost-efficient infrastructure for AI-native startups and global enterprises.

What you will do

  • Design, build, and improve automation, tooling, and infrastructure systems supporting AI and HPC workloads.
  • Contribute to the development of control-plane systems and operational frameworks.
  • Define and implement SLOs, SLIs, and monitoring strategies to ensure system reliability.
  • Participate in incident response and root cause analysis, driving improvements to reduce recurrence.
  • Collaborate with Engineering, Network, and Fleet teams to improve system design and operational processes.
  • Drive improvements in availability, scalability, and operational efficiency.

Requirements

  • 5–8+ years of experience in SRE, Systems Engineering, or Software Engineering in production environments.
  • Strong software engineering skills with experience building automation and infrastructure tooling.
  • Solid understanding of Linux systems, networking, and distributed systems.
  • Experience troubleshooting issues across infrastructure, OS, networking, and application layers.
  • Familiarity with monitoring, alerting, and observability tools.
  • Ability to balance reliability, performance, and delivery speed.

Nice to have

  • Experience with AI or HPC environments, including GPUs or high-performance systems.
  • Exposure to high-speed networking (InfiniBand/RDMA).
  • Familiarity with Kubernetes, cloud platforms, or bare-metal environments.
  • Experience with observability systems in high-scale environments.

Culture & Benefits

  • Culture is defined by ownership, accountability, and rapid innovation.
  • Operate with urgency and transparency.
  • Competitive benefits package including medical, dental, vision, flexible paid time off, parental leave, and retirement plan participation.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →