HPC Systems Engineer (Linux)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
HPC Systems Engineer (Linux/Slurm): Managing reliability and performance of high-performance computing environments with an accent on Slurm cluster operations and Linux system engineering. Focus on automating cluster provisioning, tuning scheduling and storage performance, and ensuring system stability for research workloads.
Company
provides specialized operations and infrastructure management for high-performance computing environments.
What you will do
- Operate and evolve Slurm configurations (partitions, QoS, fairshare) to balance throughput, priority, and cost.
- Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
- Automate infrastructure using Ansible and standardize golden images and node deployment workflows.
- Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
- Implement monitoring and alerting via Prometheus and Grafana to define SLOs and on-call playbooks.
- Collaborate with research teams to translate workload needs into capacity plans and queue policies.
Requirements
- 3β7 years of experience administering production Linux systems in multi-node environments.
- Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
- Strong Linux fundamentals: systemd, networking, storage, and security hardening.
- Proficiency in scripting with Bash and/or Python to build reliable operational tooling.
- Experience working in a ticketed/on-call environment with a strong focus on root-cause analysis.
Nice to have
- Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
- Knowledge of HPC containers such as Apptainer or Singularity.
- Experience with Infrastructure-as-Code (Terraform) and Git-based change management.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β