Назад
Company hidden
1 день назад

Software Engineer, Hardware Health (AI)

250 000 - 445 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Software Engineer, Hardware Health (AI): Building and optimizing critical infrastructure to maintain the health and observability of global compute clusters with an accent on automated remediation and fleet efficiency. Focus on developing health signals for GPUs/CPUs, minimizing downtime for frontier model training, and automating node lifecycle workflows.

Location: San Francisco

Salary: $250,000 – $445,000 + Equity

Company

hirify.global is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.

What you will do

  • Define and maintain health signals for GPUs, CPUs, networking, and platform infrastructure.
  • Develop scalable health checks to detect, remediate, and verify hardware failures.
  • Investigate system-level issues and hardware failures across large-scale compute environments.
  • Manage node lifecycle workflows, including drain, quarantine, repair, and RMA processes.
  • Build automation and tooling for global cluster management to minimize manual intervention.
  • Collaborate with reliability and workload teams to integrate health signals into AI training systems.

Requirements

  • 7+ years of industry experience in software or infrastructure engineering.
  • Proficiency with Python and shell scripting.
  • Experience building large-scale distributed systems or infrastructure platforms.
  • Ability to analyze operational data using SQL, PromQL, or similar tools.
  • Strong systems debugging skills and an ownership mindset.

Nice to have

  • Experience with low-level hardware (PCIe, InfiniBand, RoCE) and Linux kernel tuning.
  • Experience operating large-scale GPU or accelerator clusters.
  • Expertise in systems telemetry or network operations.
  • Experience with automated remediation or fleet lifecycle management.

Culture & Benefits

  • Competitive salary and equity packages.
  • Opportunity to work on the last line of defense for frontier model training.
  • Collaborative environment pushing the boundaries of AI capabilities.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →