Назад
Company hidden
2 дня назад

HPC Systems Engineer (Linux/Slurm)

Формат работы
remote
Тип работы
fulltime
Грейд
middle/senior
Английский
b2
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

HPC Systems Engineer (Linux/Slurm): Owning the reliability, performance, and operability of high-performance computing environments with an accent on Slurm-based cluster management and automation. Focus on building stable scheduling, optimizing job throughput, and implementing repeatable configuration for research and ML workloads.

Location: Remote

Company

hirify.global focuses on the reliability and performance of high-performance computing environments for researchers and engineers.

What you will do

  • Operate and evolve Slurm clusters, managing partitions, QoS, fair-share policies, and job accounting.
  • Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
  • Automate infrastructure and configuration management using Ansible to standardize golden images and node workflows.
  • Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
  • Own monitoring and observability stacks using Prometheus and Grafana to define SLOs and on-call playbooks.
  • Collaborate with research and ML teams to translate workload requirements into capacity plans and queue policies.

Requirements

  • 3–7 years of experience administering production Linux systems in multi-node environments.
  • Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
  • Strong Linux fundamentals: systemd, networking, storage, and security hardening.
  • Proficiency in scripting and automation using Bash and/or Python.
  • Experience with monitoring and observability tools to support capacity planning and uptime.
  • Ability to work in a ticketed/on-call environment and conduct root-cause analysis for incidents.

Nice to have

  • Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
  • Knowledge of HPC containers such as Apptainer or Singularity.
  • Experience with Infrastructure-as-Code (IaC) and CI practices using Terraform and Git.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →