ML Infrastructure Engineer (AI)

Формат работы

remote (только Europe/United_states)

Тип работы

fulltime

Английский

Страна

Вакансия из Hirify RU Global, списка компаний с восточно-европейскими корнями
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

ML Infrastructure Engineer (AI/GPU): Lead and support benchmarking of GPU platforms for machine learning and AI workloads with an accent on performance profiling, kernel-level analysis, and hardware optimization. Focus on evaluating architectures and software stacks, debugging bottlenecks, performing acceptance testing, and developing tools for visualization and data-driven decisions.

Location: Remote from Europe or United States. Applicants must be authorized to work in the country in which they apply and provide proof of employment eligibility.

Company

hirify.global is building a full-stack AI cloud platform supporting developers and enterprises from data and model training to production deployment.

What you will do

Profile and analyze GPU performance at system and kernel level in collaboration with hardware and development teams.
Evaluate and compare GPU performance across platforms, architectures, and software stacks like CUDA and ROCm.
Debug and optimize ML workloads on GPU hardware, resolving performance bottlenecks.
Conduct acceptance testing for new GPU clusters to ensure performance, stability, and compatibility for AI workloads.
Run experiments on diverse GPU configurations to assess interconnect strategies and system optimizations.
Develop tools and dashboards to visualize performance metrics, bottlenecks, and trends.
Contribute to internal tooling, frameworks, and best practices.

Requirements

Profound understanding of machine learning theoretical foundations
Deep understanding of performance aspects of large neural networks training and inference (data/tensor parallelism, offloading, custom kernels, hardware features, attention optimizations, dynamic batching)
Deep experience with modern deep learning frameworks (PyTorch, JAX, Megatron-LM, Tensor-LLM)
Good understanding of GPU stack: CUDA, NCCL, drivers, relevant libraries
Familiarity with containerized environments (Docker, Kubernetes)
Strong communication skills and ability to work independently

Nice to have

Familiarity with modern LLM inference frameworks (vLLM, SGLang, TensorRT)
Experience in Python and performance profiling tools (Nsight, nvprof, perf)
Familiarity with cloud ML platforms (AWS, GCP, Azure ML)
Contributions to open-source ML benchmarking tools

Culture & Benefits

Competitive compensation
Career growth and learning opportunities
Flexibility and work-life balance
Collaborative and innovative culture
Opportunity to work on impactful AI projects
International environment with talented teams
Fast-moving environment with bold thinking, constant growth, trust, ownership, and impact

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →