Staff Engineer, HPC Systems Software (AI)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
Staff Engineer, HPC Systems Software (AI): Architecting and maintaining the operating system foundation for global hardware design infrastructure with an accent on bare-metal provisioning and configuration-as-code. Focus on scaling OS lifecycle management across hundreds of compute nodes and optimizing Linux kernel performance for AI hardware development.
Location: Hybrid: Must be based in Austin (TX), Santa Clara (CA), or Toronto (CA)
Salary: $100k - $500k
Company
is a startup leading the industry in cutting-edge AI technology and high-performance RISC-V CPUs.
What you will do
- Design and maintain automated OS deployment pipelines for global bare-metal HPC clusters.
- Manage large-scale configuration using Ansible to ensure consistency across compute infrastructure.
- Deploy and lifecycle manage RHEL and Ubuntu systems across diverse hardware platforms.
- Implement infrastructure-as-code for repeatable, version-controlled system configurations.
- Troubleshoot OS-level issues and optimize kernel parameters to resolve performance bottlenecks.
- Collaborate with hardware design teams to standardize system configurations and development environments.
Requirements
- Experience in RHEL and Ubuntu administration within HPC or large-scale compute environments.
- High proficiency in Ansible for automation across hundreds of nodes.
- Experience with bare-metal provisioning systems such as MAAS, Foreman, Cobbler, or Warewulf.
- Deep understanding of Linux internals, networking, kernel tuning, and performance troubleshooting.
- Familiarity with HPC cluster architecture and infrastructure-as-code practices.
- Must be eligible to access U.S. export-controlled technology (EAR compliance).
Nice to have
- Hands-on experience with IBM Spectrum LSF or similar HPC workload managers.
- Integration with commercial HPC storage platforms like Pure Storage, Weka, or Vast Data.
- Exposure to EDA tools and hardware design workflows in semiconductor development.
- Experience with container technologies including Docker, Singularity, or Podman.
- Cluster monitoring skills using Prometheus, Grafana, and custom tooling.
- Python and bash scripting for production-level infrastructure automation.
Culture & Benefits
- Highly competitive compensation package including base and variable targets.
- Collaborative environment with a focus on curiosity and solving hard technical problems.
- Opportunity to work on revolutionary AI platforms and RISC-V CPU architecture.
- Equal opportunity employer.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β