Senior Solutions Engineer (AI Infrastructure)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Senior Solutions Engineer (AI Infrastructure): Building customer solution architectures for large-scale AI, HPC, analytics, and data-intensive workloads with an accent on GPU clusters, high-performance storage and networking, Kubernetes platforms, and distributed training/inference environments. Focus on technical discovery, PoC planning and execution, and translating complex infrastructure requirements into deployment guidance that drives production success.
Location: Remote (United States)
Company
provides infrastructure solutions for large-scale AI and data-intensive workloads.
What you will do
- Lead technical discovery with customers across infrastructure, platform, ML, data, and executive stakeholders.
- Design architectures for large-scale AI, HPC, analytics, and enterprise data workloads.
- Evaluate infrastructure tradeoffs involving GPUs, storage, networking, orchestration, and data movement.
- Design and execute proofs of concept to validate performance, scale, reliability, and business value.
- Debug customer issues across Linux, storage, networking, Kubernetes, schedulers, GPUs, and application workloads.
- Create technical assets (demos, runbooks, field guidance) and support production deployment planning.
Requirements
- 8 to 12+ years of technical experience with significant hands-on infrastructure experience.
- Experience building, operating, or architecting production platform infrastructure.
- Strong understanding of Linux internals and distributed systems (including Paxos and Raft), plus storage and networking implementation details.
- Experience with one or more: GPU infrastructure, large-scale HPC, Kubernetes platforms, MLOps, storage systems, cloud infrastructure, or data platforms.
- Ability to communicate credibly with engineers, architects, technical executives, and business stakeholders.
- Strong discovery, problem-solving, and systems debugging skills in ambiguous, fast-moving environments.
Nice to have
- Experience with large-scale GPU clusters, distributed training/inference, and AI platforms.
- Experience with petabyte-scale storage and high-performance data systems.
- Experience with orchestration/scheduling tools such as Kubernetes, Slurm, Ray, or Spark.
- Domain expertise with systems like Lustre, Ceph, Weka, BeeGFS, GPFS, VAST, object storage, or distributed filesystems.
- Experience with InfiniBand/RoCE/RDMA and high-performance Ethernet, plus NVIDIA/Mellanox networking.
- Hands-on experience with CUDA/NCCL/DCGM/GPUDirect, checkpointing, dataset staging, or model-serving infrastructure.
Culture & Benefits
- Customer-facing technical role focused on deep infrastructure problem solving and clear solution design.
- Work end-to-end from discovery and evaluation through deployment planning and production success.
- Operate without a rigid playbook in fast-moving, ambiguous environments.
- Partner with product and engineering to feed field feedback into the roadmap.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →