2 дня назад
HPC Systems Engineer (Linux/Slurm)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
HPC Systems Engineer (Linux/Slurm): Owning the reliability, performance, and operability of high-performance computing environments with an accent on Slurm-based cluster management and automation. Focus on building stable scheduling, optimizing job throughput, and implementing repeatable configuration for research and ML workloads.
Location: Remote
Company
focuses on the reliability and performance of high-performance computing environments for researchers and engineers.
What you will do
- Operate and evolve Slurm clusters, managing partitions, QoS, fair-share policies, and job accounting.
- Administer Linux cluster nodes, including provisioning, patching, and lifecycle maintenance across heterogeneous hardware.
- Automate infrastructure and configuration management using Ansible to standardize golden images and node workflows.
- Troubleshoot performance and reliability issues across compute, storage (Lustre/GPFS/NFS), and networking.
- Own monitoring and observability stacks using Prometheus and Grafana to define SLOs and on-call playbooks.
- Collaborate with research and ML teams to translate workload requirements into capacity plans and queue policies.
Requirements
- 3–7 years of experience administering production Linux systems in multi-node environments.
- Hands-on experience operating and supporting Slurm in an HPC or research/engineering compute setting.
- Strong Linux fundamentals: systemd, networking, storage, and security hardening.
- Proficiency in scripting and automation using Bash and/or Python.
- Experience with monitoring and observability tools to support capacity planning and uptime.
- Ability to work in a ticketed/on-call environment and conduct root-cause analysis for incidents.
Nice to have
- Experience with InfiniBand/RDMA and parallel filesystems (e.g., Lustre, BeeGFS).
- Knowledge of HPC containers such as Apptainer or Singularity.
- Experience with Infrastructure-as-Code (IaC) and CI practices using Terraform and Git.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →