5 месяцев назад
Staff Software Engineer, GPU Infrastructure (HPC) (AI Engineering)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
Текст:
TL;DR
Staff Software Engineer (AI Engineering): Responsible for building world-class infrastructure and tools used to train, evaluate and serve foundational models with an accent on stability, scalability, and observability. Focus on deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
Location: Remote
Company
is training and deploying frontier models for developers and enterprises who are building AI systems to power magical experiences.
What you will do
- Build and scale ML-optimized HPC infrastructure by deploying and managing Kubernetes-based GPU/TPU superclusters across multiple clouds.
- Optimize for AI/ML training and collaborate with cloud providers to fine-tune infrastructure for cost efficiency, reliability, and performance.
- Troubleshoot and resolve complex issues to ensure minimal disruption to AI/ML workflows.
- Enable researchers with self-service tools, designing intuitive interfaces, and workflows.
- Drive innovation in ML infrastructure and translate them into robust, scalable infrastructure solutions.
- Champion best practices for observability, automation, and infrastructure-as-code (IaC).
Requirements
- Deep expertise in ML/HPC infrastructure, including experience with GPU/TPU clusters and distributed training frameworks.
- Proven ability to deploy, manage, and troubleshoot cloud-native Kubernetes clusters for AI workloads.
- Strong programming skills in Python (for ML tooling) and Go (for systems engineering).
- Familiarity with Linux internals, RDMA networking, and performance optimization for ML workloads.
- Experience working closely with AI researchers or ML engineers to solve infrastructure challenges.
- Self-directed problem-solving skills.
Culture & Benefits
- Enjoy an open and inclusive culture and work environment.
- Work closely with a team on the cutting edge of AI research.
- Receive full health and dental benefits, including a separate budget for mental health.
- Benefit from remote-flexible work arrangements and a co-working stipend.
- Enjoy 6 weeks of vacation (30 working days!).
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →
Похожие вакансии
Anthropic
6 дней назад
Ml Infrastructure Engineer (AI)
320 000 - 405 000$
6 дней назад
Staff Software Engineer (AI)
210 000 - 235 000$
Anthropic
2 дня назад
Staff+ Software Engineer (AI)
405 000 - 485 000$
4 дня назад
Staff Software Engineer, Docker Agents (Ai)
6 дней назад
Software Engineer (AI/ML)
Snowflake
7 дней назад