Назад
Company hidden
5 месяцев назад

Staff Software Engineer, GPU Infrastructure (HPC) (AI Engineering)

Формат работы
remote (Global)
Тип работы
fulltime
Английский
b2
Страна
US/Canada
Вакансия из списка Hirify.GlobalВакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:
/

TL;DR

Staff Software Engineer (AI Engineering): Responsible for building world-class infrastructure and tools used to train, evaluate and serve foundational models with an accent on stability, scalability, and observability. Focus on deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.

Location: Remote

Company

hirify.global is training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences.

What you will do

  • Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
  • Optimize for AI/ML training and collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance.
  • Troubleshoot and resolve complex issues to ensure minimal disruption to AI/ML workflows.
  • Enable researchers with self-service tools, designing intuitive interfaces, and workflows.
  • Drive innovation in ML infrastructure and translate them into robust, scalable infrastructure solutions.
  • Champion best practices for observability, automation, and infrastructure-as-code (IaC).

Requirements

  • Deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters and distributed training frameworks.
  • Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
  • Strong programming skills in Python (for ML tooling) and Go (for systems engineering).
  • Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
  • Experience working closely with AI researchers or ML engineers to solve infrastructure challenges.
  • Self-directed problem-solving skills.

Culture & Benefits

  • Enjoy an open and inclusive culture and work environment.
  • Work closely with a team on the cutting edge of AI research.
  • Receive full health and dental benefits, including a separate budget for mental health.
  • Benefit from remote-flexible work arrangements and a co-working stipend.
  • Enjoy 6 weeks of vacation (30 working days!).

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →