Systems Engineer (HPC)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Systems Engineer (HPC): Designing, operating, and scaling high-performance infrastructure for AI platforms with an accent on Linux environment management, HPC cluster reliability, and large-scale automation. Focus on scaling systems to thousands of nodes, managing petabyte-scale storage, and optimizing performance for research and production workloads.
Location: Must be based in the US or Canada (Montreal, Toronto, New York, Palo Alto, San Francisco).
Company
is a pioneering startup building high-performance, open, and efficient AI systems to power the next generation of applications.
What you will do
- Operate and maintain large-scale Linux environments across bare metal, clusters, and cloud.
- Monitor system health, troubleshoot incidents, and ensure high availability for research and production workloads.
- Scale infrastructure to support thousands of nodes and petabyte-scale storage systems.
- Automate operational tasks and improve provisioning using Python, Bash, Ansible, or Terraform.
- Collaborate with HPC, platform, and research teams to drive system architecture decisions.
Requirements
- Must be based in the US or Canada.
- Strong Linux systems administration experience.
- Experience working in large-scale environments such as HPC clusters or cloud infrastructure.
- Proficiency with job schedulers like Slurm.
- Solid troubleshooting skills across systems, hardware, and networks.
Nice to have
- Experience with container orchestration like Kubernetes.
- Knowledge of storage systems such as Ceph, Lustre, or NFS.
- Networking fundamentals including Ethernet and InfiniBand.
- Experience with Infrastructure as Code and automation tooling.
- Background in GPU or AI/ML infrastructure.
Culture & Benefits
- Opportunity to shape data center operations from the ground up in a high-growth AI startup.
- Collaborative, low-ego, and highly technical team environment.
- Competitive compensation and benefits package.
- Direct impact on scaling cutting-edge AI infrastructure.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →