Research Engineer (Infrastructure RL Systems)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Research Engineer (Infrastructure RL Systems): Design and build scalable, efficient infrastructure for large-scale reinforcement learning training and post-training workloads with an accent on reliability, observability, and orchestration. Focus on optimizing distributed RL training pipelines, collaborating with researchers, and building production-grade systems for stable and fast reinforcement learning.
Location: San Francisco, California, USA
Salary: $350,000 - $475,000 USD per year
Company
advances collaborative general intelligence by building scalable AI infrastructure and widely used AI products and open-source projects.
What you will do
- Design, build, and optimize infrastructure for large-scale reinforcement learning and post-training workloads.
- Improve reliability, scalability, and throughput of RL training pipelines and distributed workloads.
- Develop monitoring and observability tools to ensure high uptime and debuggability.
- Collaborate with researchers to translate algorithms into production-grade training pipelines.
- Build evaluation and benchmarking infrastructure for model progress on helpfulness, safety, and factuality.
- Publish learnings through documentation, open-source libraries, or technical reports.
Requirements
- Location: Must be based in San Francisco or able to work onsite.
- Bachelor’s degree or equivalent in computer science, engineering, machine learning, or related fields.
- Strong engineering skills with ability to write performant, maintainable code and debug complex systems.
- Understanding of deep learning frameworks such as PyTorch and JAX and their system architectures.
- Experience or interest in reinforcement learning workloads and distributed training frameworks.
- Ability to collaborate across teams and take initiative in cross-stack projects.
Nice to have
- Experience training or supporting large-scale language models with tens of billions of parameters.
- Background in high-performance or reliability engineering, cluster orchestration (Kubernetes, Slurm).
- Familiarity with monitoring tools like Prometheus, Grafana, OpenTelemetry.
- Contributions to large-scale ML research, open-source frameworks, or performance optimization.
Culture & Benefits
- Visa sponsorship and relocation support available.
- Generous health, dental, and vision benefits.
- Unlimited paid time off and paid parental leave.
- Collaborative environment with cross-functional teams.
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →