Senior Infrastructure Support Engineer (GPUs)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Infrastructure Support Engineer (GPUs): Maintaining and optimizing high-performance GPU cloud infrastructure for AI workloads with an accent on Linux systems engineering, Kubernetes, and high-speed networking. Focus on resolving complex technical incidents, automating operational tasks, and improving system observability.
Location: Singapore (Onsite)
Company
is a GPU cloud engineered specifically for AI, providing cost-effective, high-performance infrastructure for AI start-ups and large enterprises.
What you will do
- Participate in the Support duty rotation, collaborating with Infrastructure, SRE, and Product Engineering on incidents and changes.
- Proactively improve dashboards, alerts, and runbooks to prevent repeat incidents.
- Manage and resolve technical tickets while keeping internal and external stakeholders informed.
- Design and implement automation scripts and tools to optimize operational processes.
- Conduct root cause analysis (RCA) for major incidents and recommend long-term architectural fixes.
- Respond to critical incidents during out-of-business hours as part of an on-call rotation.
Requirements
- Location: Must be based in Singapore with ability to provide onsite technical expertise.
- Expertise in Linux systems engineering at scale, including kernel modules and networking stack troubleshooting.
- Experience operating and troubleshooting Kubernetes (K8s) clusters.
- Practical experience with GPU platforms (NVIDIA/AMD), including drivers, nvidia-smi, and NCCL diagnostics.
- Strong networking fundamentals: L2/L3, BGP, VLANs, VXLAN, and high-performance fabrics (RDMA/NVLink).
- Proficiency in Bash, Python, or JavaScript, and infrastructure automation tools (Ansible, Terraform, Puppet, or Chef).
Nice to have
- Experience with automated network deployment and configuration in critical environments.
- Knowledge of GPU HPC concepts, including InfiniBand, MPI, and Pyxis/Enroot.
- Experience building CI/CD pipelines using GitOps tooling and GitHub Actions.
Culture & Benefits
- Culture of relentless innovation, ownership, and high accountability.
- Environment based on openness, transparency, and candid communication.
- Customer-centric focus with a commitment to delivering impactful AI solutions.
- Strong emphasis on sustainability and long-term environmental responsibility.
- Inclusive workplace with an equal opportunities statement for diverse backgrounds.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →