Site Reliability Engineer (AI Infrastructure)

Формат работы

remote (Global)

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (AI Infrastructure): Provisioning and operating Kubernetes-based clusters for AI workloads with an accent on automation, scalability, and observability. Focus on building the foundation for reliable global AI compute and solving complex networking and scheduling challenges.

Location: Global Remote / San Francisco, CA

Company

hirify.global provides early-stage startups with scaled AI infrastructure and is building a global liquidity layer for AI compute.

What you will do

Provision, configure, and operate Kubernetes-based clusters for customers across multiple providers.
Build automation and tooling to streamline cluster deployments and integrations.
Debug customer issues across networking, storage, scheduling, and system layers.
Improve reliability and scalability of both training and inference infrastructure.
Design and implement monitoring, alerting, and observability for critical systems.
Participate in on-call and incident response, leading postmortems and reliability improvements.

Requirements

5+ years experience in SRE, DevOps, or infrastructure engineering roles.
Strong Linux systems and networking fundamentals.
Deep experience with Kubernetes and container orchestration at scale.
Proficiency with Infrastructure-as-Code tools such as Terraform, Helm, and Ansible.
Strong automation and scripting skills in Python, Go, or Bash.
Experience with observability stacks including Prometheus, Grafana, Loki, and Datadog.

Nice to have

Exposure to ML/AI infrastructure or GPU-based systems (CUDA, Slurm, Triton).
Familiarity with high-performance networking (InfiniBand, NVLink) or distributed storage (VAST, Weka, Ceph).
Customer-facing support or consulting experience.

Culture & Benefits

High level of ownership and autonomy to shape how systems run.
Direct collaboration with customers and providers.
Opportunity to build the foundation for reliable, scalable AI infrastructure.
Builder-centric environment focusing on solving hard engineering problems.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Site Reliability Engineer (AI Infrastructure)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Site Reliability Engineer

Site Reliability Engineer (AI)

Senior DevOps Engineer (AI)

Senior DevOps Engineer (Azure)

Staff Observability Platform Engineer (AI)

Senior SRE (Web3)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business