Software Engineer, Hardware Health (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Software Engineer, Hardware Health (AI): Building and optimizing critical infrastructure to maintain the health and observability of global compute clusters with an accent on automated remediation and fleet efficiency. Focus on developing health signals for GPUs/CPUs, minimizing downtime for frontier model training, and automating node lifecycle workflows.
Location: San Francisco
Salary: $250,000 – $445,000 + Equity
Company
is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
What you will do
- Define and maintain health signals for GPUs, CPUs, networking, and platform infrastructure.
- Develop scalable health checks to detect, remediate, and verify hardware failures.
- Investigate system-level issues and hardware failures across large-scale compute environments.
- Manage node lifecycle workflows, including drain, quarantine, repair, and RMA processes.
- Build automation and tooling for global cluster management to minimize manual intervention.
- Collaborate with reliability and workload teams to integrate health signals into AI training systems.
Requirements
- 7+ years of industry experience in software or infrastructure engineering.
- Proficiency with Python and shell scripting.
- Experience building large-scale distributed systems or infrastructure platforms.
- Ability to analyze operational data using SQL, PromQL, or similar tools.
- Strong systems debugging skills and an ownership mindset.
Nice to have
- Experience with low-level hardware (PCIe, InfiniBand, RoCE) and Linux kernel tuning.
- Experience operating large-scale GPU or accelerator clusters.
- Expertise in systems telemetry or network operations.
- Experience with automated remediation or fleet lifecycle management.
Culture & Benefits
- Competitive salary and equity packages.
- Opportunity to work on the last line of defense for frontier model training.
- Collaborative environment pushing the boundaries of AI capabilities.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →