Research Engineer, RL Infrastructure and Reliability (Knowledge Work) (AI)
Мэтч & Сопровод
Для мэтча с этой вакансией нужен Plus
Описание вакансии
TL;DR
Research Engineer, RL Infrastructure and Reliability (Knowledge Work): Own reliability, observability, and infrastructure foundation for Knowledge Work training environments and evaluations with an accent on proactive hardening, stress-testing at scale, and high-signal metrics. Focus on building stable systems, automating operational tooling, and driving incidents to resolution for partner teams.
Location: San Francisco, CA (hybrid policy: at least 25% time in office)
Salary: $350,000 - $850,000 USD
Company
Quickly growing AI research organization building reliable, interpretable, and steerable AI systems like Claude.
What you will do
- Serve as dedicated reliability owner for Knowledge Work training environments, providing continuity and reducing operational overhead
- Own clean, canonical evaluation tools and processes, including for model releases
- Build and automate observability, dashboards, and tooling with emphasis on trusted metrics and alerts
- Proactively harden systems via load testing, fault injection, and stress testing at realistic scale
- Act as primary contact for partner teams on environment issues and drive incident resolution
- Reduce operational burden on researchers to keep them focused on research
Requirements
- Highly experienced Python engineer shipping reliable, well-instrumented production code
- Demonstrated experience operating ML or distributed systems at scale, including on-call and incident response
- Strong SRE or production-engineering mindset with SLOs, load tests, and failure injection
- Foundational ML knowledge to understand training environments, evaluations, and integrity issues
- Able to read research code and reason about evaluation integrity
- Bachelor’s degree or equivalent in relevant field
Nice to have
- 5+ years operating ML or distributed systems at scale
- Experience with RL environments, agent harnesses, or LLM evaluation frameworks
- Familiarity with reward modeling, evaluation design, or reward hacking mitigation
- Experience with observability stacks, dashboard tooling, chaos engineering, or large-scale load testing
- Background in data quality pipelines, drift detection, or evaluation curation
- Familiarity with large-scale training/inference infrastructure
- Prior role as reliability or operations owner in research team
Culture & Benefits
- Collaborative team focused on high-impact AI research as big science
- Competitive compensation, equity donation matching, generous vacation and parental leave
- Flexible working hours and lovely office space in San Francisco
- Visa sponsorship available (with reasonable effort and immigration lawyer support)
- Emphasis on diverse perspectives and representation in AI development
Будьте осторожны: если работодатель просит войти в их систему, используя iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →