Software Engineer, Workload Enablement (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Software Engineer, Workload Enablement (AI): Enabling production workloads and end-to-end testing on new AI platforms with an accent on creating test harnesses and platform stress benchmarks. Focus on porting existing inference and training workloads to new systems/hardware, analyzing performance bottlenecks, and characterizing the end-to-end behavior of new systems.
Location: San Francisco or Seattle, USA
Salary: $293K – $455K + Offers Equity
Company
is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.
What you will do
- Port and validate key inference and training workloads on new platforms/SKUs, driving correctness, performance, and stability.
- Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads.
- Deep-dive performance on distributed training/inference, including collective performance and tuning.
- Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs.
- Partner with systems + fleet bring-up engineers to ensure the platform is stable, performant, operationally usable, and scalable.
- Work cross-functionally with vendors and internal stakeholders by producing clear bug reports and prioritized issue lists.
Requirements
- BS in CS/EE (or equivalent practical experience).
- 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
- Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
- Experience with large-scale distributed training concepts.
- Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL).
- Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
- Strong profiling/debugging skills.
Nice to have
- Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior.
- Familiarity with RDMA networking and transport tuning.
- Experience running and validating workloads in Kubernetes.
- Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).
Culture & Benefits
- Committed to providing reasonable accommodations to applicants with disabilities.
- Believes artificial intelligence has the potential to help people solve immense global challenges.
- Equal opportunity employer, and does not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →