Site Reliability Engineer (Supercomputing)

255 000 - 490 000$

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Site Reliability Engineer (Supercomputing): Operating and scaling the next generation of compute clusters that power hirify.global’s frontier research with an accent on distributed systems engineering and hands-on infrastructure work. Focus on managing fast-moving operations, diagnosing issues, and building automation for large-scale systems.

Location: Onsite in San Francisco

Salary: $255K – $490K + Offers Equity

Company

hirify.global is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.

What you will do

Spin up and scale large Kubernetes clusters, including automation for provisioning and lifecycle management.
Build software abstractions that unify multiple clusters for seamless training workloads.
Own node bring-up from bare metal through firmware upgrades.
Improve operational metrics such as reducing cluster restart times and accelerating upgrade cycles.
Integrate networking and hardware health systems for end-to-end reliability.
Develop monitoring and observability systems to maintain cluster stability.

Requirements

Experience as an infrastructure or distributed systems engineer in large-scale environments.
Strong knowledge of Kubernetes internals and containerized workloads.
Proficiency in cloud infrastructure concepts and automating operations.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →