DeepSeek-R1 Training Phases
DeepSeek-R1 (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) goes through a 4 stage training pipeline, as shown in Figure ?? .

Figure 1: The four stages of DeepSeek-R1 training pipeline. CS SFT + RORL stand for Cold-Start SFT + Reasoning-Oriented Reinforcement Learning respectively. (Image Source: Harris Chan)
1. Cold-Start. This involves constructing and collecting a small amount of long Chain-of-Thought (CoT) data to fine-tune the DeepSeek-V3-Base model. The purpose of this initial Supervised Fine-Tuning (SFT) is to prevent an unstable early cold start phase of RL training and to ensure human-friendly, readable output and formatting. This addresses a key limitation observed in DeepSeek-R1-Zero, which struggled with poor readability and language mixing due to its pure RL approach.
This data is constructed and collected from various approaches, including using few-shot prompting with long Chain-of-Thought (CoT) examples, directly prompting models to generate detailed answers with reflection and verification, and gathering readable outputs from DeepSeek-R1-Zero, which are then refined by human annotators through post-processing.
Phi-4-mini-reasoning (Xu et al., 2025 Xu Haoran, et al. (2025). Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math. https://doi.org/10.48550/arXiv.2504.21233 ) uses a combination of both automatic methods such as Math-Verify as well as LLMs as judges to assess the correctness of the generated answers. To maintain dataset balance, each sample is annotated with attributes including the domain category, the difficulty level, and the presence of repetitive patterns.
2. Reasoning-oriented Reinforcement Learning. The Reinforcement Learning with Verifiable Rewards (RLVR) phase applies the same large-scale RL training process as used in DeepSeek-R1-Zero, which relies on the Group Relative Policy Optimization (GRPO) algorithm. This phase specifically focuses on improving the model’s reasoning capabilities in domains with well-defined problems and clear solutions, such as coding, mathematics, science, and logical reasoning. To mitigate the language mixing issues observed in DeepSeek-R1-Zero, DeepSeek-R1’s RL training in this stage introduces a “language consistency reward”.
Math-Verify is utilised by many open projects to do automatic verification in the math domain (Xu et al., 2025 Xu Haoran, et al. (2025). Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math. https://doi.org/10.48550/arXiv.2504.21233 ; Guha et al., 2025 Guha Etash, et al. (2025). OpenThoughts: Data Recipes for Reasoning Models. https://doi.org/10.48550/arXiv.2506.04178 ; He et al., 2025 He Jujie, et al.(2025). Retrieved from https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680 ; Harris et al., 2025 Harris et al.(2025, 6/25). Retrieved from https://www.primeintellect.ai/blog/synthetic-1-release ; Zhao et al., 2025 Zhao Han, et al. (2025). 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training. https://doi.org/10.48550/arXiv.2503.19633 ).
3. Rejection Sampling and Supervised Fine-Tuning. When reasoning-oriented RL converges, the resulting checkpoint is used to collect SFT (Supervised Fine-Tuning) data for the subsequent round. Unlike the initial cold-start data, which primarily focuses on reasoning, this stage incorporates data from other domains to enhance the model’s capabilities in writing, role-playing, and other general-purpose tasks.
Reasoning data. Reasoning prompts are used to generate reasoning traces by performing rejection sampling from the checkpoint from the previous RL training. In the previous stage, only included data that could be evaluated using rule-based rewards. However, in this stage, the dataset is expanded by incorporating additional data, some of which use a generative reward model by feeding the ground-truth and model predictions into DeepSeek-V3 for judgment. Additionally, because the model output is sometimes chaotic and difficult to read, we have filtered out chain-of-thought with mixed languages, long paragraphs, and code blocks. Rejection sampling involves sampling multiple responses for each prompt and retaining only the correct ones. The dataset consists of about 600k samples.
The rejected samples are not always completely discarded. Phi-4-Mini-Reasoning (Xu et al., 2025 Xu Haoran, et al. (2025). Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math. https://doi.org/10.48550/arXiv.2504.21233 ) also re-purposes the rejected (incorrect) rollouts for its “Rollout Preference Learning” stage. These incorrect responses, especially those with minor nuances compared to correct ones, are used to construct informative preference pairs, with correct answers designated as ‘preferred’ and incorrect ones as ‘dis-preferred’ for Direct Preference Optimization (DPO).
Non-Reasoning Data. For non-reasoning data, such as writing, factual QA, self-cognition, and translation, the DeepSeek-V3 pipeline is used, and portions of the SFT dataset of DeepSeek-V3 are reused. For certain non-reasoning tasks, DeepSeek-V3 is called to generate a potential chain-of-thought before answering the question by prompting. However, for simpler queries, such as “hello,” a CoT is not provided in response. In the end, approximately 200k training samples unrelated to reasoning were collected.
DeepSeek-V3-Base is then fine-tuned for two epochs using this curated dataset of about 800k samples.
4.Reinforcement Learning for All Scenarios. To further align the model with human preferences, a secondary reinforcement learning stage is implemented aimed at improving the model’s helpfulness and harmlessness while simultaneously refining its reasoning capabilities. Specifically, the model is trained using a combination of reward signals and diverse prompt distributions. For reasoning data, the methodology outlined in DeepSeek-R1-Zero is followed, which utilizes rule-based rewards to guide the learning process in math, code, and logical reasoning domains.
For general data, reward models are used to capture human preferences in complex and nuanced scenarios. Specifically, for helpfulness, the assessment focuses exclusively on the final summary to emphasise the utility and relevance of the response. For harmlessness, the entire response (reasoning process and summary) is evaluated to mitigate potential risks, biases, or harmful content.
Citation
If you found this post useful for your work, please consider citing it as:
Razvan Florian Vasile. (July 2025). "LinAlgZero: Literature Review". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/literature/.
or
@misc{vasile2025linalgzero,
title = "LinAlgZero: Literature Review",
author = "Razvan Florian Vasile",
note = "Personal blog",
year = "2025",
month = "July",
url = "https://atomwalk12.github.io/posts/lingalgzero/literature/"
}
References
- [1] Zhao Han, et al. (2025). 1.4 Million Open-Source Distilled Reasoning Dataset to Empower Large Language Model Training. https://doi.org/10.48550/arXiv.2503.19633
- [2] DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948
- [3] Guha Etash, et al. (2025). OpenThoughts: Data Recipes for Reasoning Models. https://doi.org/10.48550/arXiv.2506.04178
- [4] Xu Haoran, et al. (2025). Phi-4-Mini-Reasoning: Exploring the Limits of Small Reasoning Language Models in Math. https://doi.org/10.48550/arXiv.2504.21233
- [5] He Jujie, et al.(2025). Retrieved from https://capricious-hydrogen-41c.notion.site/Skywork-Open-Reaonser-Series-1d0bc9ae823a80459b46c149e4f51680
- [6] Harris et al.(2025, 6/25). Retrieved from https://www.primeintellect.ai/blog/synthetic-1-release