HPC Systems Engineer (Linux)

Тип работы

fulltime

Грейд

middle/senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Systems Engineer (Linux/Slurm): Managing reliability and performance of high-performance computing environments with an accent on Slurm cluster operations and Linux system engineering. Focus on automating cluster provisioning, tuning scheduling and storage performance, and ensuring system stability for research workloads.

Company

hirify.global provides specialized operations and infrastructure management for high-performance computing environments.

What you will do

Operate and evolve Slurm configurations (partitions, QoS, fairshare) to balance throughput, priority, and cost.
Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
Automate infrastructure using Ansible and standardize golden images and node deployment workflows.
Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
Implement monitoring and alerting via Prometheus and Grafana to define SLOs and on-call playbooks.
Collaborate with research teams to translate workload needs into capacity plans and queue policies.

Requirements

3–7 years of experience administering production Linux systems in multi-node environments.
Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
Strong Linux fundamentals: systemd, networking, storage, and security hardening.
Proficiency in scripting with Bash and/or Python to build reliable operational tooling.
Experience working in a ticketed/on-call environment with a strong focus on root-cause analysis.

Nice to have

Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
Knowledge of HPC containers such as Apptainer or Singularity.
Experience with Infrastructure-as-Code (Terraform) and Git-based change management.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →