TL;DR
Research Scientist (AI/LLM Frontier Models): Building and optimizing post-training frontier models, especially Gemini, with an accent on architecting Reward Modeling and Reinforcement Learning strategies for hard capabilities like chain-of-thought reasoning. Focus on designing novel post-training pipelines, advancing reward models, and solving the "flywheel" challenge for continuous model improvement across multimodal domains.
Location: Zurich, Switzerland (Onsite)
Company
hirify.global is a team of scientists and engineers working to advance state-of-the-art AI, focusing on widespread public benefit, scientific discovery, safety, and ethics.
What you will do
- Design and validate novel post-training pipelines (SFT, RLHF, RLAIF) specifically for frontier-class models.
- Lead research into next-gen Reward Models, including investigating new architectures and improving signal-to-noise ratios.
- Develop innovative methods to improve the model's internal reasoning (chain-of-thought), focusing on correctness, logic, and self-correction.
- Critically re-evaluate and optimize RL prompts and feedback mechanisms to extract maximum performance from base models.
- Create robust mechanisms to turn user signals and interactions into training data for continuous model improvement.
- Collaborate across teams to apply advanced recipes to various model sizes and modalities (e.g., Audio).
Requirements
- PhD in machine learning, artificial intelligence, or computer science (or equivalent practical experience).
- Strong background in Large Language Models (LLMs), Reinforcement Learning (RL), or preference learning.
- Research interest in aligning AI systems with human feedback and utility.
- Familiarity with experiment design and analyzing large-scale user data.
- Strong coding and communication skills.
Nice to have
- Experience with RLHF (Reinforcement Learning from Human Feedback) or DPO (Direct Preference Optimization).
- Experience building or improving reward models and conducting human evaluation studies.
- A proven track record of publications in top-tier conferences (e.g., NeurIPS, ICML, ICLR).
- Experience with Chain-of-Thought (CoT) reasoning research or process-based supervision.
- Deep understanding and experience training models from scratch or using self-play/self-improvement techniques.
Culture & Benefits
- Fosters an environment where ambitious, long-term research flourishes.
- Committed to diversity of experience, knowledge, backgrounds, and perspectives.
- Ensures safety and ethics are the highest priority in AI development.
- Provides equal employment opportunity regardless of protected characteristics.
- Offers accommodation for disabilities or additional needs.
Будьте осторожны: если вас просят войти в iCloud/Google, прислать код/пароль, запустить код/ПО, не делайте этого - это мошенники. Обязательно жмите "Пожаловаться" или пишите в поддержку. Подробнее в гайде →