AI Infrastructure Engineer (HPC)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
AI Infrastructure Engineer (GPU/HPC): Operating and optimizing GPU clusters and implementing elastic scheduling for inference and training with an accent on high-throughput serving, distributed communication stacks, and unified orchestration. Focus on tuning vLLM/SGLang runtimes, optimizing NCCL/RDMA communication, and building comprehensive observability for GPU utilization.
Location: Singapore, SG
Company
is a world-leading technology company specializing in Bitcoin mining solutions and AI cloud services, operating a global portfolio of HPC datacenters.
What you will do
- Operate and optimize GPU clusters using Kubernetes, Slurm, and Ray across multiple regions.
- Implement elastic scheduling and unified orchestration for inference and training jobs using Kueue, NVIDIA KAI Scheduler, or KEDA.
- Manage and tune vLLM and SGLang runtimes for high-throughput, low-latency serving, focusing on continuous batching and KV-cache paging.
- Optimize distributed communication stacks, including NCCL/RCCL, RDMA over RoCEv2, and InfiniBand.
- Benchmark and profile performance across various model sizes (7B to 70B+) and precisions (FP8, AWQ, GPTQ).
- Build observability stacks with Prometheus, Grafana, and OpenTelemetry to monitor GPU utilization and latency.
Requirements
- Bachelor’s or Master’s degree in Computer Science, Electrical Engineering, or a related field (PhD preferred).
- 4–8+ years of experience in backend engineering, distributed systems, platform engineering, or applied AI.
- Strong proficiency in Python, plus experience with Go, TypeScript, Rust, or C++.
- Hands-on experience with Kubernetes, Slurm, or Ray.
- Strong background in PyTorch/JAX and distributed communication stacks (NCCL/RCCL, RDMA).
- Fluent in English.
Nice to have
- Experience with major cloud platforms and designing production-grade architectures.
- Familiarity with retrieval systems, embeddings, and vector stores like Qdrant, Chroma, or pgvector.
- Experience with agent frameworks, tool-calling, function orchestration, or MCP.
Culture & Benefits
- Opportunity to work in a global environment with datacenters in the US, Bhutan, Norway, Canada, Malaysia, and Ethiopia.
- Exposure to world-leading technology in ASIC chip design and HPC cloud capabilities.
- Commitment to equal employment opportunities and a diverse, inclusive workplace.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →