Staff Engineer, HPC Systems Software (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Staff Engineer, HPC Systems Software (AI): Architecting and maintaining the operating system foundation for global hardware design infrastructure with an accent on bare-metal provisioning and configuration-as-code. Focus on scaling OS lifecycle management across hundreds of compute nodes and optimizing Linux kernel performance for AI hardware development.
Location: Hybrid: Must be based in Austin (TX), Santa Clara (CA), or Toronto (CA)
Salary: $100k - $500k
Company
is a startup leading the industry in cutting-edge AI technology and high-performance RISC-V CPUs.
What you will do
- Design and maintain automated OS deployment pipelines for global bare-metal HPC clusters.
- Manage large-scale configuration using Ansible to ensure consistency across compute infrastructure.
- Deploy and lifecycle manage RHEL and Ubuntu systems across diverse hardware platforms.
- Implement infrastructure-as-code for repeatable, version-controlled system configurations.
- Troubleshoot OS-level issues and optimize kernel parameters to resolve performance bottlenecks.
- Collaborate with hardware design teams to standardize system configurations and development environments.
Requirements
- Experience in RHEL and Ubuntu administration within HPC or large-scale compute environments.
- High proficiency in Ansible for automation across hundreds of nodes.
- Experience with bare-metal provisioning systems such as MAAS, Foreman, Cobbler, or Warewulf.
- Deep understanding of Linux internals, networking, kernel tuning, and performance troubleshooting.
- Familiarity with HPC cluster architecture and infrastructure-as-code practices.
- Must be eligible to access U.S. export-controlled technology (EAR compliance).
Nice to have
- Hands-on experience with IBM Spectrum LSF or similar HPC workload managers.
- Integration with commercial HPC storage platforms like Pure Storage, Weka, or Vast Data.
- Exposure to EDA tools and hardware design workflows in semiconductor development.
- Experience with container technologies including Docker, Singularity, or Podman.
- Cluster monitoring skills using Prometheus, Grafana, and custom tooling.
- Python and bash scripting for production-level infrastructure automation.
Culture & Benefits
- Highly competitive compensation package including base and variable targets.
- Collaborative environment with a focus on curiosity and solving hard technical problems.
- Opportunity to work on revolutionary AI platforms and RISC-V CPU architecture.
- Equal opportunity employer.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →