Назад
Company hidden
3 дня назад

Senior SRE Engineer (HPC & Cloud)

Формат работы
onsite
Тип работы
fulltime
Грейд
senior
Английский
b2
Страна
Taiwan
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Senior SRE Engineer (HPC & Cloud): Managing large-scale Linux environments and HPC clusters with an accent on automation, cloud infrastructure, and storage optimization. Focus on building internal AI platforms, optimizing CI/CD pipelines, and ensuring the reliability of high-performance compute services.

Location: Taiwan

Company

hirify.global is a quantitative trading firm specializing in high-frequency trading and advanced research.

What you will do

  • Manage large-scale Linux environments, focusing on troubleshooting and deep root-cause analysis.
  • Develop maintainable automation using Bash, Ansible, and Python for infrastructure operations.
  • Operate HPC clusters (Slurm) and maintain high-performance storage solutions like Lustre and NAS.
  • Manage multi-cloud infrastructure across AWS, GCP, and Alibaba Cloud using Terraform and AWS CDK.
  • Build and operate Docker/Kubernetes (ECS, EKS) environments and design GitLab CI/CD pipelines.
  • Develop internal AI platforms, chatbots, and agents utilizing LangChain, Bedrock, and Elasticsearch RAG.

Requirements

  • 5+ years of hands-on Linux systems administration and infrastructure operations experience.
  • Deep knowledge of Linux internals including process, memory, filesystem, networking, and cgroups.
  • Proficiency in Bash/Shell scripting and Python for data processing and API services.
  • Solid experience with RAID, filesystem selection, and shared storage operations (NFS/SMB).
  • Experience with public cloud providers (AWS/GCP/Alibaba) and IaC tooling (Terraform/Ansible).
  • Ability to drive complex technical subsystems end-to-end with strong autonomy and minimal supervision.

Nice to have

  • Experience with HPC schedulers (Slurm, PBS, LSF) or parallel filesystems (Lustre, GPFS).
  • Advanced Linux performance analysis skills using eBPF, perf, or ftrace.
  • Database operations experience with MySQL or ClickHouse.
  • GPU server operations including NVIDIA driver management, CUDA toolkit, and Slurm GRES configuration.
  • LLM application development experience with LangChain or RAG.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →