Назад
Company hidden
3 часа назад

Principal Site Reliability Engineer (AI)

Формат работы
remote (Global)
Тип работы
fulltime
Грейд
senior
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Principal Site Reliability Engineer (AI): Leading reliability strategy and architectural design for high-performance AI and HPC infrastructure with an accent on scalability, automation, and operational excellence. Focus on designing large-scale control-plane systems, defining reliability standards, and driving systemic improvements across GPU and network platforms.

Company

hirify.global provides high-performance, cost-effective GPU cloud infrastructure engineered specifically for AI start-ups and enterprise customers.

What you will do

  • Own and evolve the long-term reliability strategy for AI and HPC infrastructure.
  • Design and lead the development of large-scale control-plane systems and automation frameworks.
  • Define reliability standards, SLO frameworks, and operational best practices.
  • Act as a senior technical escalation point during critical incidents to ensure systemic resolution.
  • Partner with cross-functional leadership to influence platform design and operational maturity.
  • Mentor senior and mid-level engineers to elevate SRE practices across the organization.

Requirements

  • 10+ years of experience in SRE, Systems, or Software Engineering operating complex infrastructure.
  • Expert-level software engineering skills in building production-grade automation.
  • Deep expertise in Linux, networking, and distributed systems design at scale.
  • Extensive experience debugging failures across hardware, OS, network, and application layers.
  • Proven ability to lead technical initiatives across teams without direct authority.
  • Strong systems-thinking mindset balancing reliability, velocity, and cost.

Nice to have

  • Hands-on experience with AI/HPC platforms, InfiniBand/RDMA, and workload schedulers like SLURM.
  • Experience designing observability systems for high-cardinality and high-throughput environments.
  • Familiarity with Kubernetes at scale and hybrid or bare-metal cloud architectures.

Culture & Benefits

  • Competitive base and equity package with annual reviews.
  • Remote-first environment with a focus on trust, autonomy, and flexible work.
  • Opportunity to work at a fast-growing startup building cutting-edge AI technology.
  • Collaborative, supportive environment with a focus on professional growth and progression.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →