Shuyao Xu†, Cheng Peng, Jiangxuan Long†, Weidi Xu**

*: Team lead

† Work done during an internship at INF AI

— Apr. 23, 2024

<aside> ✨

TL;DR

We propose Reinforcement Distillation (REDI), an efficient approach for large language models (LLMs) post-training using offline RL and distilled data. Our REDI-1.5B-Preview model, fine-tuned from Qwen2.5-Math-1.5B using a curated 78k subset of the OpenR1 dataset (leveraging both positive and negative examples), achieves 83.1% on MATH-500 (pass@1). It performs comparably to or better than DeepSeek-R1-Distill-Qwen-1.5B on several math benchmarks, establishing a new state-of-the-art for 1.5B models fine-tuned offline using openly available distilled data.

A key finding is that asymmetric weighting of positive and negative sample gradients during optimization significantly enhances training stability and performance, allowing us to surpass DPO/SimPO without KL regularization.

👨‍💻 Github: https://github.com/Tim-Siu/reinforcement-distillation

</aside>

REDI-1.5B-Preview

Model	AIME24	AMC23	MATH-500	Minerva	Olympiad Bench	Avg.
Deepseek-R1-Distill-Qwen-1.5b	28.3	62.1	83.2	26.0	43.1	48.5
SimpleRL-Zero	4.2	35.0	59.0	20.2	21.0	27.9
LUFFY	15.2	46.8	79.4	26.5	42.4	42.1
REDI-SFT-1.5B	24.0	57.3	80.4	27.6	41.1	47.0
REDI-1.5B-Preview	28.1	62.4	83.1	28.8	45.2	49.5

Evaluation based on DeepScaleR or taken as reported in report. Scores are pass@1.

SimpleRL-Zero was trained from Qwen2.5-1.5B with on-policy RL. All other models listed are post-trained from Qwen2.5-Math-1.5B. LUFFY was trained on open data with both on-policy and off-policy RL; Deepseek-R1-Distill-Qwen-1.5b was trained with closed data.

Introduction

Many reasoning LLMs such as Deepseek-R1 and Gemini 2.5 Pro output thinking process, allowing small models to distill the high-quality thinking process from large models. For example, OpenR1-Math-220k releases 220k math problems with two to four reasoning traces generated by DeepSeek R1. Standard distillation practice uses rejection sampling, i.e., only fine-tuning on distilled positive examples while discarding negative ones. Off-policy RL methods such as DPO and its variants can utilize negative samples, but their effectiveness in long CoT reasoning domain is under-explored.

Our work demonstrates that regularization effects in DPO-like algorithms limit their capability in learning new knowledge, and the key factor for training success lies in asymmetric weighting for ****positive and negative ****samples ****during optimization. By introducing such asymmetric weighting to simple policy gradient objectives, we achieve better test-time performance as well as training stability. Our method properly leveraging positive and negative examples alike in the off-policy regime simultaneously increases test-time accuracy and training data efficiency compared with rejection sampling, all the while avoiding the “wasted inference” that comes with discarding negative examples. In the remaining blog post, we’ll walk through our dataset curation and training approach, present evaluation results, and share key insights from our findings.

REDI Recipe

Datasets

Our training data originates from the OpenR1-Math-Raw dataset, excluding the less challenging cn_k12 subset. We leveraged the dataset's own verification labels to construct two distinct sets:

SFT Dataset (REDI Positives): Comprising 78k high-quality positive instances for SFT. Inclusion required that the dataset's pre-existing labels marked an example as correct according to both the Llama-3-70B judge and the Math-Verify tool. This strict verification keeps the dataset size manageable for experiments.
Preference Dataset (REDI Pairs): Containing approximately 53k preference pairs designed for DPO, SimPO, and RED training. To form these pairs, we identified prompts within OpenR1-Math-Raw that had at least one response meeting the dual-correct criteria (REDI Positives) and also at least one response flagged as incorrect by both label types. Each pair then consists of:
- A positive example (subset of REDI Positives).
- A negative example for the same prompt, specifically an instance where the dataset's Llama-3-70B and Math-Verify labels both indicated incorrectness.

SFT Baseline

We established SFT baselines by fine-tuning the Qwen2.5-Math-1.5B model on our REDI Positives (78k) dataset: