TL;DR
ML Infrastructure Engineer (AI): Building large-scale compute, storage, and software infrastructure to support Cursor’s work building the world’s best agentic coding model with an accent on improving throughput and reliability of training. Focus on creating software and systems to automate building, monitoring, and running GPU clusters and optimizing compute environments for large RL workloads.
Company
hirify.global is building the world's best agentic coding model.
What you will do
- Collaborate with ML researchers to improve the throughput and reliability of training.
- Work with OEMs, cloud service providers, and others to plan and build cutting-edge GPU infrastructure.
- Improve the density and scalability of compute environments to enable increasingly large RL workloads.
- Create software and systems to automate building, monitoring, and running GPU clusters.
- Build workload scheduling and data movement systems to support Cursor’s growing training footprint.
Requirements
- Strong background in systems and infrastructure-focused software engineering, particularly in Python, Typescript, Rust, and Golang.
- Experience with distributed storage and networking infrastructure, particularly on Linux systems across cloud and bare metal environments.
- Exposure to large-scale systems and their unique challenges, ideally across thousands of nodes with significant resource footprints.
- Production use of infrastructure-as-code and configuration management, across hosts and Kubernetes.
- English: B2 required.
Nice to have
- Operational exposure to Nvidia GPUs with Infiniband or RoCE, particularly with Blackwell and Hopper-class hardware.
- Exposure to Ray, Slurm, or other common compute and runtime schedulers.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →