Company hidden

обновлено 5 дней назад

Software Engineer, Hardware Health (AI)

250 000 - 445 000$

Формат работы

onsite

Тип работы

fulltime

Грейд

senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Software Engineer, Hardware Health (AI): Building and optimizing critical infrastructure to maintain the health and observability of global compute clusters with an accent on automated remediation and fleet efficiency. Focus on developing health signals for GPUs/CPUs, minimizing downtime for frontier model training, and automating node lifecycle workflows.

Location: San Francisco

Salary: $250,000 – $445,000 + Equity

Company

hirify.global is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.

What you will do

Define and maintain health signals for GPUs, CPUs, networking, and platform infrastructure.
Develop scalable health checks to detect, remediate, and verify hardware failures.
Investigate system-level issues and hardware failures across large-scale compute environments.
Manage node lifecycle workflows, including drain, quarantine, repair, and RMA processes.
Build automation and tooling for global cluster management to minimize manual intervention.
Collaborate with reliability and workload teams to integrate health signals into AI training systems.

Requirements

7+ years of industry experience in software or infrastructure engineering.
Proficiency with Python and shell scripting.
Experience building large-scale distributed systems or infrastructure platforms.
Ability to analyze operational data using SQL, PromQL, or similar tools.
Strong systems debugging skills and an ownership mindset.

Nice to have

Experience with low-level hardware (PCIe, InfiniBand, RoCE) and Linux kernel tuning.
Experience operating large-scale GPU or accelerator clusters.
Expertise in systems telemetry or network operations.
Experience with automated remediation or fleet lifecycle management.

Culture & Benefits

Competitive salary and equity packages.
Opportunity to work on the last line of defense for frontier model training.
Collaborative environment pushing the boundaries of AI capabilities.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →

Похожие вакансии

Software Engineer, Hardware Health (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Похожие вакансии

Senior Hardware Support Engineer (AI)

Network Engineer (Supercomputing)

Senior Systems Engineer (Network)

HPC Systems Engineer (Semiconductor)

Corporate IT Engineer (AI)

Data Center Technician (AI)

Разработка

Game Dev

Design и Creative

Аналитика

Менеджмент

People & Business

Software Engineer, Hardware Health (AI)

Мэтч & Сопровод

Описание вакансии

TL;DR

Company

What you will do

Requirements

Nice to have

Culture & Benefits

Categories

Похожие вакансии

Senior Hardware Support Engineer (AI)

Network Engineer (Supercomputing)

Senior Systems Engineer (Network)

HPC Systems Engineer (Semiconductor)

Corporate IT Engineer (AI)

Data Center Technician (AI)