TL;DR

Research Engineer (Agentic Models): Designing, implementing, and maintaining SFT and RL post-training pipelines for multi-step coding agents with an accent on model adaptation for agent workflows and building evaluation environments. Focus on designing evaluation frameworks, analyzing results, and improving model architectures and datasets.

Location: Remote from Germany, or onsite in Netherlands, Serbia, Germany, Cyprus, United Kingdom, Czech Republic, Poland, or Armenia.

Company

%hirify_global%: Developing powerful and effective developer tools, increasingly integrating AI-powered assistance and agents into IDEs.

What you will do

Design, implement, and maintain SFT and RL post-training pipelines for multi-step coding agents.
Train and adapt LLMs for agent workflows, including planning and tool use within %hirify_global% IDEs.
Build and develop evaluation and simulation environments for coding agents on realistic developer tasks.
Design evaluation frameworks and metrics, analyze traces and logs, and close the loop from evaluation back into training and data.
Analyze training and evaluation results to propose and implement improvements to model architectures and datasets.
Work with large-scale infrastructure, including distributed GPU and MapReduce clusters.

Requirements

Hands-on experience training LLMs (pre-training, fine-tuning, or post-training) in a research or production setting.
Experience with a modern deep learning framework, such as PyTorch, and specialized LLM training stacks.
Solid understanding of LLM training basics – tokenization, data pipelines, batching, mixed precision, distributed training.
Ability to own projects end to end, overseeing design, experimentation, implementation, and iteration.
A product-aware mindset, translating product needs and failure modes into modeling and evaluation work.
At least 3 years of Python experience writing clean, maintainable code in modern ML codebases.

Nice to have

Experience with ML orchestrators and workflow tools such as Kubeflow, Dagster, Airflow, or job schedulers like Kubernetes.
Experience with large-scale data and training pipelines (MapReduce-style clusters, multi-node GPU training).
Designing and maintaining evaluation pipelines for LLMs or agents, including metrics, dashboards, and automated regression checks.
AI agent development, such as tool-using agents, planners, or multi-step coding workflows.
Experiment tracking and observability using tools like Weights & Biases, MLflow, or similar.
Inference optimization and serving optimized models in production.