Research Engineer (Infrastructure RL Systems)

350 000 - 475 000$

Формат работы

onsite

Тип работы

fulltime

Грейд

middle/senior

Английский

Страна

Вакансия из Hirify Global, списка международных tech-компаний
Для мэтча и отклика нужен Plus

Мэтч & Сопровод

Для мэтча с этой вакансией нужен Plus

Описание вакансии

Текст:

TL;DR

Research Engineer (Infrastructure RL Systems): Design and build scalable, efficient infrastructure for large-scale reinforcement learning training and post-training workloads with an accent on reliability, observability, and orchestration. Focus on optimizing distributed RL training pipelines, collaborating with researchers, and building production-grade systems for stable and fast reinforcement learning.

Location: San Francisco, California, USA

Salary: $350,000 - $475,000 USD per year

Company

hirify.global advances collaborative general intelligence by building scalable AI infrastructure and widely used AI products and open-source projects.

What you will do

Design, build, and optimize infrastructure for large-scale reinforcement learning and post-training workloads.
Improve reliability, scalability, and throughput of RL training pipelines and distributed workloads.
Develop monitoring and observability tools to ensure high uptime and debuggability.
Collaborate with researchers to translate algorithms into production-grade training pipelines.
Build evaluation and benchmarking infrastructure for model progress on helpfulness, safety, and factuality.
Publish learnings through documentation, open-source libraries, or technical reports.

Requirements

Location: Must be based in San Francisco or able to work onsite.
Bachelor’s degree or equivalent in computer science, engineering, machine learning, or related fields.
Strong engineering skills with ability to write performant, maintainable code and debug complex systems.
Understanding of deep learning frameworks such as PyTorch and JAX and their system architectures.
Experience or interest in reinforcement learning workloads and distributed training frameworks.
Ability to collaborate across teams and take initiative in cross-stack projects.

Nice to have

Experience training or supporting large-scale language models with tens of billions of parameters.
Background in high-performance or reliability engineering, cluster orchestration (Kubernetes, Slurm).
Familiarity with monitoring tools like Prometheus, Grafana, OpenTelemetry.
Contributions to large-scale ML research, open-source frameworks, or performance optimization.

Culture & Benefits

Visa sponsorship and relocation support available.
Generous health, dental, and vision benefits.
Unlimited paid time off and paid parental leave.
Collaborative environment with cross-functional teams.

Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →