ML Infrastructure Engineer (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
ML Infrastructure Engineer (AI/GPU): Lead and support benchmarking of GPU platforms for machine learning and AI workloads with an accent on performance profiling, kernel-level analysis, and hardware optimization. Focus on evaluating architectures and software stacks, debugging bottlenecks, performing acceptance testing, and developing tools for visualization and data-driven decisions.
Location: Remote from Europe or United States. Applicants must be authorized to work in the country in which they apply and provide proof of employment eligibility.
Company
is building a full-stack AI cloud platform supporting developers and enterprises from data and model training to production deployment.
What you will do
- Profile and analyze GPU performance at system and kernel level in collaboration with hardware and development teams.
- Evaluate and compare GPU performance across platforms, architectures, and software stacks like CUDA and ROCm.
- Debug and optimize ML workloads on GPU hardware, resolving performance bottlenecks.
- Conduct acceptance testing for new GPU clusters to ensure performance, stability, and compatibility for AI workloads.
- Run experiments on diverse GPU configurations to assess interconnect strategies and system optimizations.
- Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
- Contribute to internal tooling, frameworks, and best practices.
Requirements
- Profound understanding of machine learning theoretical foundations
- Deep understanding of performance aspects of large neural networks training and inference (data/tensor parallelism, offloading, custom kernels, hardware features, attention optimizations, dynamic batching)
- Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensor-LLM)
- Good understanding of GPU stack: CUDA, NCCL, drivers, relevant libraries
- Familiarity with containerized environments (Docker, Kubernetes)
- Strong communication skills and ability to work independently
Nice to have
- Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT)
- Experience in Python and performance profiling tools (Nsight, nvprof, perf)
- Familiarity with cloud ML platforms (AWS, GCP, Azure ML)
- Contributions to open-source ML benchmarking tools
Culture & Benefits
- Competitive compensation
- Career growth and learning opportunities
- Flexibility and work-life balance
- Collaborative and innovative culture
- Opportunity to work on impactful AI projects
- International environment with talented teams
- Fast-moving environment with bold thinking, constant growth, trust, ownership, and impact
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →