Назад
Company hidden
19 часов Π½Π°Π·Π°Π΄

HPC Systems Engineer (Linux)

Π’ΠΈΠΏ Ρ€Π°Π±ΠΎΡ‚Ρ‹
fulltime
Π“Ρ€Π΅ΠΉΠ΄
middle/senior
Английский
b2
Вакансия ΠΈΠ· списка Hirify.GlobalВакансия ΠΈΠ· Hirify Global, списка ΠΌΠ΅ΠΆΠ΄ΡƒΠ½Π°Ρ€ΠΎΠ΄Π½Ρ‹Ρ… tech-ΠΊΠΎΠΌΠΏΠ°Π½ΠΈΠΉ
Для мэтча ΠΈ ΠΎΡ‚ΠΊΠ»ΠΈΠΊΠ° Π½ΡƒΠΆΠ΅Π½ Plus

ΠœΡΡ‚Ρ‡ & Π‘ΠΎΠΏΡ€ΠΎΠ²ΠΎΠ΄

Для мэтча с этой вакансиСй Π½ΡƒΠΆΠ΅Π½ Plus

ОписаниС вакансии

ВСкст:
/

TL;DR

HPC Systems Engineer (Linux/Slurm): Managing reliability and performance of high-performance computing environments with an accent on Slurm cluster operations and Linux system engineering. Focus on automating cluster provisioning, tuning scheduling and storage performance, and ensuring system stability for research workloads.

Company

hirify.global provides specialized operations and infrastructure management for high-performance computing environments.

What you will do

  • Operate and evolve Slurm configurations (partitions, QoS, fairshare) to balance throughput, priority, and cost.
  • Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
  • Automate infrastructure using Ansible and standardize golden images and node deployment workflows.
  • Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
  • Implement monitoring and alerting via Prometheus and Grafana to define SLOs and on-call playbooks.
  • Collaborate with research teams to translate workload needs into capacity plans and queue policies.

Requirements

  • 3–7 years of experience administering production Linux systems in multi-node environments.
  • Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
  • Strong Linux fundamentals: systemd, networking, storage, and security hardening.
  • Proficiency in scripting with Bash and/or Python to build reliable operational tooling.
  • Experience working in a ticketed/on-call environment with a strong focus on root-cause analysis.

Nice to have

  • Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
  • Knowledge of HPC containers such as Apptainer or Singularity.
  • Experience with Infrastructure-as-Code (Terraform) and Git-based change management.

Π‘ΡƒΠ΄ΡŒΡ‚Π΅ остороТны: Ссли Ρ€Π°Π±ΠΎΡ‚ΠΎΠ΄Π°Ρ‚Π΅Π»ΡŒ просит Π²ΠΎΠΉΡ‚ΠΈ Π² ΠΈΡ… систСму, ΠΈΡΠΏΠΎΠ»ΡŒΠ·ΡƒΡ iCloud/Google, ΠΏΡ€ΠΈΡΠ»Π°Ρ‚ΡŒ ΠΊΠΎΠ΄/ΠΏΠ°Ρ€ΠΎΠ»ΡŒ, Π·Π°ΠΏΡƒΡΡ‚ΠΈΡ‚ΡŒ ΠΊΠΎΠ΄/ПО, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡ‚Π΅ этого - это мошСнники. ΠžΠ±ΡΠ·Π°Ρ‚Π΅Π»ΡŒΠ½ΠΎ ΠΆΠΌΠΈΡ‚Π΅ "ΠŸΠΎΠΆΠ°Π»ΠΎΠ²Π°Ρ‚ΡŒΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡˆΠΈΡ‚Π΅ Π² ΠΏΠΎΠ΄Π΄Π΅Ρ€ΠΆΠΊΡƒ. ΠŸΠΎΠ΄Ρ€ΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β†’