Infrastructure Support Engineer (GPUs)
ΠΡΡΡ & Π‘ΠΎΠΏΡΠΎΠ²ΠΎΠ΄
ΠΠ»Ρ ΠΌΡΡΡΠ° Ρ ΡΡΠΎΠΉ Π²Π°ΠΊΠ°Π½ΡΠΈΠ΅ΠΉ Π½ΡΠΆΠ΅Π½ Plus
ΠΠΏΠΈΡΠ°Π½ΠΈΠ΅ Π²Π°ΠΊΠ°Π½ΡΠΈΠΈ
TL;DR
Infrastructure Support Engineer (GPUs): Maintaining and troubleshooting high-performance GPU cloud infrastructure for AI workloads with an accent on service reliability and rapid incident response. Focus on managing Kubernetes clusters, Linux-based systems, and GPU-specific diagnostics to ensure seamless AI development for customers.
Location: Singapore (includes availability to travel to or Customer locations)
Company
is a GPU cloud provider engineered specifically for AI startups and large enterprises to reduce the complexity of AI development.
What you will do
- Handle day-to-day tickets and alerts within the support duty rotation, escalating complex incidents to Engineering.
- Resolve common issues using established runbooks and contribute to their improvement and incremental fixes.
- Monitor, troubleshoot, and triage platform issues, capturing logs and facts for efficient handover.
- Collaborate with cross-functional teams and serve as the escalation point for onsite operations staff.
- Document validated steps and contribute to training materials to build team capability.
- Identify and implement automation opportunities to optimize support processes.
Requirements
- 2-4 years of experience in support, operations, or infrastructure engineering, ideally within cloud or Data Centre environments.
- Proficiency in Linux CLI, system services, filesystems, permissions, and basic networking tools.
- Solid grasp of networking basics: IP addressing, subnets, VLANs, routing, DNS, and firewalls.
- Exposure to Kubernetes core concepts (nodes, pods, services, logs) and basic troubleshooting.
- Familiarity with GPU diagnostics such as nvidia-smi.
- Ability to write simple Bash or Python scripts and use Git for version control.
Nice to have
- Hands-on experience with Kubernetes administration, operators, or specialized storage/networking add-ons.
- Knowledge of RDMA/InfiniBand, HPC concepts, and NCCL for performance troubleshooting.
- Experience with Infrastructure as Code tools like Ansible or Terraform.
- Participation in GitOps and CI/CD pipelines using GitHub Actions.
- Experience with security tooling such as Teleport or Vault.
Culture & Benefits
- Culture of relentless innovation, ownership, and accountability.
- Commitment to openness, transparency, and an open-source approach to build trust.
- Dedicated focus on sustainability and reducing the environmental impact of AI technologies.
- Fast, efficient, and respectful collaboration within a global team.
- Inclusive environment with an equal opportunities statement for diverse backgrounds.
ΠΡΠ΄ΡΡΠ΅ ΠΎΡΡΠΎΡΠΎΠΆΠ½Ρ: Π΅ΡΠ»ΠΈ ΡΠ°Π±ΠΎΡΠΎΠ΄Π°ΡΠ΅Π»Ρ ΠΏΡΠΎΡΠΈΡ Π²ΠΎΠΉΡΠΈ Π² ΠΈΡ ΡΠΈΡΡΠ΅ΠΌΡ, ΠΈΡΠΏΠΎΠ»ΡΠ·ΡΡ iCloud/Google, ΠΏΡΠΈΡΠ»Π°ΡΡ ΠΊΠΎΠ΄/ΠΏΠ°ΡΠΎΠ»Ρ, Π·Π°ΠΏΡΡΡΠΈΡΡ ΠΊΠΎΠ΄/ΠΠ, Π½Π΅ Π΄Π΅Π»Π°ΠΉΡΠ΅ ΡΡΠΎΠ³ΠΎ - ΡΡΠΎ ΠΌΠΎΡΠ΅Π½Π½ΠΈΠΊΠΈ. ΠΠ±ΡΠ·Π°ΡΠ΅Π»ΡΠ½ΠΎ ΠΆΠΌΠΈΡΠ΅ "ΠΠΎΠΆΠ°Π»ΠΎΠ²Π°ΡΡΡΡ" ΠΈΠ»ΠΈ ΠΏΠΈΡΠΈΡΠ΅ Π² ΠΏΠΎΠ΄Π΄Π΅ΡΠΆΠΊΡ. ΠΠΎΠ΄ΡΠΎΠ±Π½Π΅Π΅ Π² Π³Π°ΠΉΠ΄Π΅ β