The following notes are borrowed from (Zhang et al., 2025 Zhang Chong, et al. (2025). 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models. https://doi.org/10.48550/arXiv.2505.00551 ). I will revise them as my understanding improves.
Supervised Fine-tuning
Definition 1 (SFT Loss). Given a dataset $\mathcal{D}_{\text{SFT}} \triangleq \left\{ (q_i, c_i) \right\}_{i=1}^{|\mathcal{D}|}$, where each sample $(q_i, c_i)$ consists of a question $q_i$ and a long CoT $c_i$. The long CoT can be further decomposed into a complex intermediate rationale followed by a final answer. SFT updates the parameters of the policy model $\pi_{\theta}$ by minimizing the negative log-likelihood loss:
$$ \begin{aligned} \mathcal{L}_{\text{SFT}}(\theta) \triangleq - \mathbb{E}_{(q, c) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_{\theta}(c \mid q) \right], \end{aligned} $$
where $\pi_{\theta}(c \mid q)$ denotes the probability assigned by the policy to the CoT response $c$ conditioned on the question $q$. This objective encourages the model to imitate the supervised demonstrations by maximizing the likelihood of the reference completions.
In the following example, if each sample is equally likely to be selected, then the expectation operator is averaging the negative log-likelihood loss across all training examples in the dataset:$$ \mathbb{E}_{(q,c) \sim \mathcal{D}_{\text{SFT}}}[\log \pi_\theta(c | q)] = \frac{1}{|\mathcal{D}_{\text{SFT}}|} \sum_{i=1}^{|\mathcal{D}_{\text{SFT}}|} \log \pi_\theta(c_i | q_i) $$
LLM Policy Optimization
Recent studies have introduced a groundbreaking post-training paradigm that enhances LLMs' reasoning capabilities through RL-based training. In this framework, the LLM's answer generation process for each query is formulated as an answer sampling policy, and our objective is to optimize this LLM policy to maximize the expected reward of the generated responses. According to (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ; Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ; Team et al., 2025 Team Kimi, et al. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599 ), large-scale RL-based LLM policy optimization enables the base LLM to achieve a steady improvement in reasoning accuracy while also exhibiting the emergence of long-chain reasoning in its chain-of-thought.
Definition 2 (LLM Policy Optimization). Suppose each reasoning data pair $(q,a)$ is i.i.d sampled from an underlying distribution $\mathcal{D}$, where each $q$ is a query and $a$ is the corresponding ground-truth answer. Let $\pi_{\theta}(\cdot | \cdot)$ be the target LLM policy parameterized by $\theta$. The expected reward of the LLM on a sample $(q,a)$ is $\mathbb{E}_{o\sim \pi_{\theta}(\cdot | q)} [r(o, a)]$, where $o$ is an LLM-generated response to $q$, and $r(\cdot,\cdot)$ is a predefined reward function that quantifies whether the response $o$ yields $a$. The objective of RL-based fine-tuning is to maximize the expected reward over the data distribution, i.e.,
$$ \begin{align*} \max_{\theta} J(\pi_{\theta}) \triangleq \mathbb{E}_{(q,a)\sim \mathcal{D}} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} [r(o, a)]. \end{align*} $$
A straightforward approach to maximize $J(\pi_{\theta})$ is to gradually improve the LLM's parameter $\theta$ towards the policy gradient direction $\nabla_{\theta} J(\pi_{\theta})$. However, since $ \nabla_{\theta} \mathbb{E}_{o \sim \pi_{\theta}(\cdot | q)} r(o, a) $ is the gradient of an integral dependent on $ \pi_\theta $, $ \nabla_{\theta} J(\pi_{\theta}) $ is intractable to compute via standard Monte Carlo sampling. Fortunately, the RL community has developed two powerful policy gradient estimators: REINFORCE (Williams, 1992 Williams Ronald J.. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696 ) and Importance Sampling (Sutton & Barto, 2025 Sutton Richard & Barto Andrew.(2025, 6/27). Retrieved from http://incompleteideas.net/book/the-book-2nd.html ):
$$\small \begin{align*} \nabla_{\theta} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} r(o, a) = \begin{cases} \mathbb{E}_{o \sim \pi_{\theta}(\cdot|q)} \left[ \nabla_{\theta} \log \pi_{\theta}(o|q) \cdot r(o, a) \right]\ &\text{(REINFORCE)}, \\ \mathbb{E}_{o \sim \pi_{\theta'}(\cdot|q)} \left[ \nabla_{\theta} \left( \frac{\pi_{\theta}(o|q)}{\pi_{\theta'}(o|q)} \right) \cdot r(o, a) \right]\ &\text{(Importance Sampling)}, \end{cases} \end{align*} $$
where $\pi_{\theta'}$ is any parameter-frozen LLM policy. Hence, the policy gradient $\nabla_{\theta} J(\pi_\theta)$ can be effectively approximated using standard Monte Carlo sampling: for each data pair $(q,a)$, we independently generate $G$ responses to $q$, denoted by $\{o_i\}_{i=1}^G$, using the current LLM $\pi_\theta$ or the frozen LLM $\pi_{\theta'}$, and then approximate the policy gradient estimators by
$$\footnotesize \begin{align*} \nabla_{\theta} J(\pi_\theta) = \begin{cases} \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \log \pi_{\theta}(o_i|q) \cdot r(o_i, a) \right] &\text{(REINFORCE)}, \\ \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta'}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta'}(o_i|q)} \right) \cdot r(o_i, a) \right] & \text{(Importance Sampling)}, \end{cases} \end{align*} $$
For each query $q$, the procedure of generating $G$ independent responses $\{o_{i}\}_{j=1}^G$ from $\pi_{\theta}(\cdot|q)$ is called the `rollout phase'. During this phase, the LLM policy explores enormous response samples of varying quality. Then $\theta$ is updated to increase the likelihood $ \pi_{\theta}(o_i|q) $ where $ r(o_i, a)$ is large, thereby improving the likelihood of generating responses with high rewards. Specifically, REINFORCE is an on-policy method that requires generating new rollouts using the latest LLM policy $ \pi_{\theta} $. In contrast, the importance sampling estimator can be implemented in an off-policy manner with improved sampling efficiency, as it can reuse past rollouts generated from $\pi_{\theta'}$ by storing the corresponding probability terms $ \pi_{\theta'}(o_i | q)$. A common choice is to implement $\pi_{\theta'}$ as $\pi_{\theta_{\mathrm{old}}}$, a past snapshot of the target LLM $\pi_\theta$, which is updated periodically.
In practice, the reward signals $\{r(o_i, a)\}_{i=1}^G$ are highly sparse, leading to high variance in rollout phases and policy gradient estimation. To mitigate these issues, various techniques have been developed to stabilize LLM policy gradient estimation in (\ref{eq:rl_policy_grad_estimator}). These techniques generally fall into three categories: 1) reducing sampling variance by reward normalization or using actor-critic advantage estimation, 2) stabilizing parameter updates by clipping the importance sampling weight $ \pi_\theta(o_i|q) / \pi_{\theta_{\mathrm{old}}}(o_i|q)$, and 3) constraining policy shifts by penalizing the KL-divergence $\mathrm{KL}(\pi_{\theta} | \pi_{\mathrm{ref}})$ between the current LLM policy $\pi_{\theta}$ and a fixed reference LLM policy $\pi_{\mathrm{ref}}$.
RLVR
PPO
Since its introduction in (Schulman et al., 2017 Schulman John, et al. (2017). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347 ), Proximal Policy Optimization (PPO) has become one of the most popular actor-critic RL algorithms for LLM policy optimization (Ouyang et al., 2022 Ouyang Long, et al. (2022). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c ; Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ). In addition to the target LLM policy $\pi_\theta$, which serves as the actor model, PPO introduces a critic model $V_{\phi} $—another LLM designed to learn the value for the responses generated by the actor LLM $\pi_\theta$. Specifically, the PPO objective is
$$\small \begin{equation*} \begin{aligned} J_{\mathrm{PPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}(\phi)}, \textcolor{red}{\operatorname{clip}}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}(\phi)} \Big) \right) \Bigg], \end{aligned} \end{equation*} $$
where $r_{i,t}(\theta)\triangleq \pi_{\theta}(o_{i,t}|q,o_{i, < t})/ \pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i, < t})$ denotes the likelihood ratio between the current LLM policy $\pi_\theta$ and the past LLM policy $\pi_{\theta'}$ calculated on the $t$-th token prediction step; $\hat{A}_{i, t}(\phi)$ denotes the Generalized Advantage Estimator (GAE) (Schulman et al., 2018 Schulman John, et al. (2018). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438 ) computed using the estimated value $V_{\phi}(o_{i,t}|q,o_{i, < t})$, which estimates the quality of each response generation state. $ V_{\phi} $ is trained along with $\pi_{\theta}$ to predict the value of the response generated by $\pi_{\theta}$. In practice (Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ), GAE is observed to be a more robust response quality estimator than the raw reward $r(q_i,a^*_i)$, leading to more stable LLM policy optimization.
GRPO
Group Relative Policy Optimization (GRPO) is first proposed (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) as an effective and efficient variant of PPO. Specifically, GRPO discards the critic model and GAE calculation in PPO to improve efficiency and memory consumption. To reduce the reward sampling variance, GRPO normalizes the rewards within a group of $G$ rollout outs. In addition to clipping the likelihood ratio terms, GRPO further introduces KL-divergence penalty to ensure that $\pi_\theta$ would not be driven far away from the initial SFT LLM. Specifically, the GRPO objective is
$$\footnotesize \begin{equation*} \begin{align*} J_{\mathrm{GRPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}}, \operatorname{clip}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}} \Big) - \textcolor{red}{\beta \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})}_{i,t} \right) \Bigg], \end{align*} \end{equation*} $$
where $\hat{A}_{i,t} \triangleq (r(o_i,a) - \mathrm{mean}(\{r(o_i,a)\}_{i=1}^G))/\mathrm{std}(\{r(o_i,a)\}_{i=1}^G)$ denotes the group relative reward, and $\mathbf{r} \triangleq \{r(o_i,a)\}_{i=1}^G$ denotes the rewards of the response group corresponding to each sample $(q,a)$. GRPO also incorporates the K3 KL-divergence estimator (Schulman, 2020 Schulman John.(2020, 3/7). Retrieved from http://joschu.net/blog/kl-approx.html ):
$$ \begin{align*} \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})_{i,t} \triangleq \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - \log \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - 1. \end{align*} $$
DeepSeek-R1 (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) shows that GRPO achieves stable large-scale LLM policy optimization that incentivises the long CoT pattern in large-scale LLMs.
Distillation
The following concepts are based on (Hinton et al., 2015 Hinton Geoffrey, et al. (2015). Distilling the Knowledge in a Neural Network. https://doi.org/10.48550/arXiv.1503.02531 ). They are currently being written.
Gradient Accumulation
Definition 3 (Gradient Accumulation). Below is an explanation on how to adjust the gradient accumulation steps when changing the number of GPUs. The goal is to keep the Global Batch Size constant. The formula for it is:
$$\footnotesize \text{Global Batch Size} = (\text{Number of GPUs}) \times (\text{Per-Device Batch Size}) \times (\text{Gradient Accumulation Steps}) $$
Derivation: The goal is to show how to adjust the parameters to keep the Global Batch Size constant when the number of GPUs changes. Suppose we use the subscript
_old
for the original values and_new
for the new values. The principle is:$$ \text{Global Batch Size}_{old} = \text{Global Batch Size}_{new} $$
Therefore:
$$\footnotesize (\text{GPUs}_{old} \times \text{Batch Size}_{old} \times \text{Accumulation}_{old}) = (\text{GPUs}_{new} \times \text{Batch Size}_{new} \times \text{Accumulation}_{new}) $$
To find the new required number of gradient accumulation steps, the formula is rearranged as follows:
$$ \text{Accumulation}_{new} = \frac{(\text{GPUs}_{old} \times \text{Batch Size}_{old} \times \text{Accumulation}_{old})}{(\text{GPUs}_{new} \times \text{Batch Size}_{new})} $$
This shows how the gradient accumulation steps are affected by the number of GPUs. ◼
Example 1 (Gradient Accumulation). Suppose the following values are given:
- GPUs_old: 8
- Batch Size_old: 16
- Accumulation_old: 4
- GPUs_new: 1
- Batch Size_new: 16 (this is limited by the GPU’s memory)
Plugging these into the formula to find the new gradient accumulation steps:
$$ \text{Accumulation}_{new} = \frac{(8 \times 16 \times 4)}{(1 \times 16)} = \frac{512}{16} = 32 $$
This is why the gradient_accumulation_steps
should be updated to 32 in the configuration file.
LoRA Theory
Local development leverages LoRA to reduce the number of trained parameters, effectively facilitating local debugging and development.
Citation
If you found this post useful for your work, please consider citing it as:
Razvan Florian Vasile. (July 2025). "LinAlgZero: Theory". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/theory/.
or
@misc{vasile2025linalgzero,
title = "LinAlgZero: Theory",
author = "Razvan Florian Vasile",
note = "Personal blog",
year = "2025",
month = "July",
url = "https://atomwalk12.github.io/posts/lingalgzero/theory/"
}
References
- [1] DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948
- [2] Zhang Chong, et al. (2025). 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models. https://doi.org/10.48550/arXiv.2505.00551
- [3] Hinton Geoffrey, et al. (2015). Distilling the Knowledge in a Neural Network. https://doi.org/10.48550/arXiv.1503.02531
- [4] Schulman John, et al. (2018). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438
- [5] Team Kimi, et al. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599
- [6] Schulman John.(2020, 3/7). Retrieved from http://joschu.net/blog/kl-approx.html
- [7] Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290
- [8] Schulman John, et al. (2017). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347
- [9] Williams Ronald J.. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696
- [10] Ouyang Long, et al. (2022). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c
- [11] Sutton Richard & Barto Andrew.(2025, 6/27). Retrieved from http://incompleteideas.net/book/the-book-2nd.html