Назад
Company hidden
9 часов назад

Software Engineer, Workload Enablement (AI)

293 000 - 455 000$
Формат работы
onsite
Тип работы
fulltime
Грейд
middle/senior
Английский
b2
Страна
US
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Software Engineer, Workload Enablement (AI): Enabling production workloads and end-to-end testing on new AI platforms with an accent on creating test harnesses and platform stress benchmarks. Focus on porting existing inference and training workloads to new systems/hardware, analyzing performance bottlenecks, and characterizing the end-to-end behavior of new systems.

Location: San Francisco or Seattle, USA

Salary: $293K – $455K + Offers Equity

Company

hirify.global is an AI research and deployment company dedicated to ensuring that general-purpose artificial intelligence benefits all of humanity.

What you will do

  • Port and validate key inference and training workloads on new platforms/SKUs, driving correctness, performance, and stability.
  • Build a suite of benchmarks and stress tests that capture real E2E behavior of our workloads.
  • Deep-dive performance on distributed training/inference, including collective performance and tuning.
  • Create repeatable test harnesses that run in CI / lab environments and produce actionable outputs.
  • Partner with systems + fleet bring-up engineers to ensure the platform is stable, performant, operationally usable, and scalable.
  • Work cross-functionally with vendors and internal stakeholders by producing clear bug reports and prioritized issue lists.

Requirements

  • BS in CS/EE (or equivalent practical experience).
  • 5+ years in one or more of: ML systems, performance engineering, distributed systems, or HPC.
  • Strong hands-on experience with PyTorch and modern LLM training/inference stacks.
  • Experience with large-scale distributed training concepts.
  • Experience with RDMA and debugging/optimizing comms libraries (NCCL or RCCL).
  • Proficiency in Python plus comfort reading/writing performance-critical code (C++/CUDA/HIP is a plus).
  • Strong profiling/debugging skills.

Nice to have

  • Experience building workload-shaped benchmarks and stress/fault tests that correlate to production behavior.
  • Familiarity with RDMA networking and transport tuning.
  • Experience running and validating workloads in Kubernetes.
  • Hands-on lab experience with early hardware (new NICs, new GPUs/accelerators, early racks).

Culture & Benefits

  • Committed to providing reasonable accommodations to applicants with disabilities.
  • Believes artificial intelligence has the potential to help people solve immense global challenges.
  • Equal opportunity employer, and does not discriminate on the basis of race, religion, color, national origin, sex, sexual orientation, age, veteran status, disability, genetic information, or other applicable legally protected characteristic.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →