HPC Systems Engineer (Linux/Slurm)

Формат работы

remote

Тип работы

fulltime

Грейд

middle/senior

Английский

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

HPC Systems Engineer (Linux/Slurm): Owning the reliability, performance, and operability of high-performance computing environments with an accent on Slurm-based cluster management and automation. Focus on building stable scheduling, optimizing job throughput, and implementing repeatable configuration for research and ML workloads.

Location: Remote

Company

hirify.global focuses on the reliability and performance of high-performance computing environments for researchers and engineers.

What you will do

Operate and evolve Slurm clusters, managing partitions, QoS, fair-share policies, and job accounting.
Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
Automate infrastructure and configuration management using Ansible to standardize golden images and node workflows.
Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
Own monitoring and observability stacks using Prometheus and Grafana to define SLOs and on-call playbooks.
Collaborate with research and ML teams to translate workload requirements into capacity plans and queue policies.

Requirements

3–7 years of experience administering production Linux systems in multi-node environments.
Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
Strong Linux fundamentals: systemd, networking, storage, and security hardening.
Proficiency in scripting and automation using Bash and/or Python.
Experience with monitoring and observability tools to support capacity planning and uptime.
Ability to work in a ticketed/on-call environment and conduct root-cause analysis for incidents.

Nice to have

Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
Knowledge of HPC containers such as Apptainer or Singularity.
Experience with Infrastructure-as-Code (IaC) and CI practices using Terraform and Git.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →