3 дня назад
Senior SRE Engineer (HPC & Cloud)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Senior SRE Engineer (HPC & Cloud): Managing large-scale Linux environments and HPC clusters with an accent on automation, cloud infrastructure, and storage optimization. Focus on building internal AI platforms, optimizing CI/CD pipelines, and ensuring the reliability of high-performance compute services.
Location: Taiwan
Company
is a quantitative trading firm specializing in high-frequency trading and advanced research.
What you will do
- Manage large-scale Linux environments, focusing on troubleshooting and deep root-cause analysis.
- Develop maintainable automation using Bash, Ansible, and Python for infrastructure operations.
- Operate HPC clusters (Slurm) and maintain high-performance storage solutions like Lustre and NAS.
- Manage multi-cloud infrastructure across AWS, GCP, and Alibaba Cloud using Terraform and AWS CDK.
- Build and operate Docker/Kubernetes (ECS, EKS) environments and design GitLab CI/CD pipelines.
- Develop internal AI platforms, chatbots, and agents utilizing LangChain, Bedrock, and Elasticsearch RAG.
Requirements
- 5+ years of hands-on Linux systems administration and infrastructure operations experience.
- Deep knowledge of Linux internals including process, memory, filesystem, networking, and cgroups.
- Proficiency in Bash/Shell scripting and Python for data processing and API services.
- Solid experience with RAID, filesystem selection, and shared storage operations (NFS/SMB).
- Experience with public cloud providers (AWS/GCP/Alibaba) and IaC tooling (Terraform/Ansible).
- Ability to drive complex technical subsystems end-to-end with strong autonomy and minimal supervision.
Nice to have
- Experience with HPC schedulers (Slurm, PBS, LSF) or parallel filesystems (Lustre, GPFS).
- Advanced Linux performance analysis skills using eBPF, perf, or ftrace.
- Database operations experience with MySQL or ClickHouse.
- GPU server operations including NVIDIA driver management, CUDA toolkit, and Slurm GRES configuration.
- LLM application development experience with LangChain or RAG.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →