The following notes are borrowed from ( et al., et al. (). 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models. https://doi.org/10.48550/arXiv.2505.00551 ). I will revise them as my understanding improves.

Supervised Fine-tuning

Definition 1 (SFT Loss).  Given a dataset $\mathcal{D}_{\text{SFT}} \triangleq \left\{ (q_i, c_i) \right\}_{i=1}^{|\mathcal{D}|}$, where each sample $(q_i, c_i)$ consists of a question $q_i$ and a long CoT $c_i$. The long CoT can be further decomposed into a complex intermediate rationale followed by a final answer. SFT updates the parameters of the policy model $\pi_{\theta}$ by minimizing the negative log-likelihood loss:

$$ \begin{aligned} \mathcal{L}_{\text{SFT}}(\theta) \triangleq - \mathbb{E}_{(q, c) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_{\theta}(c \mid q) \right], \end{aligned} $$

where $\pi_{\theta}(c \mid q)$ denotes the probability assigned by the policy to the CoT response $c$ conditioned on the question $q$. This objective encourages the model to imitate the supervised demonstrations by maximizing the likelihood of the reference completions.

In the following example, if each sample is equally likely to be selected, then the expectation operator is averaging the negative log-likelihood loss across all training examples in the dataset:

$$ \mathbb{E}_{(q,c) \sim \mathcal{D}_{\text{SFT}}}[\log \pi_\theta(c | q)] = \frac{1}{|\mathcal{D}_{\text{SFT}}|} \sum_{i=1}^{|\mathcal{D}_{\text{SFT}}|} \log \pi_\theta(c_i | q_i) $$

LLM Policy Optimization

Recent studies have introduced a groundbreaking post-training paradigm that enhances LLMs' reasoning capabilities through RL-based training. In this framework, the LLM's answer generation process for each query is formulated as an answer sampling policy, and our objective is to optimize this LLM policy to maximize the expected reward of the generated responses. According to ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ; et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ; et al., et al. (). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599 ), large-scale RL-based LLM policy optimization enables the base LLM to achieve a steady improvement in reasoning accuracy while also exhibiting the emergence of long-chain reasoning in its chain-of-thought.

Definition 2 (LLM Policy Optimization).  Suppose each reasoning data pair $(q,a)$ is i.i.d sampled from an underlying distribution $\mathcal{D}$, where each $q$ is a query and $a$ is the corresponding ground-truth answer. Let $\pi_{\theta}(\cdot | \cdot)$ be the target LLM policy parameterized by $\theta$. The expected reward of the LLM on a sample $(q,a)$ is $\mathbb{E}_{o\sim \pi_{\theta}(\cdot | q)} [r(o, a)]$, where $o$ is an LLM-generated response to $q$, and $r(\cdot,\cdot)$ is a predefined reward function that quantifies whether the response $o$ yields $a$. The objective of RL-based fine-tuning is to maximize the expected reward over the data distribution, i.e.,

$$ \begin{align*} \max_{\theta} J(\pi_{\theta}) \triangleq \mathbb{E}_{(q,a)\sim \mathcal{D}} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} [r(o, a)]. \end{align*} $$

A straightforward approach to maximize $J(\pi_{\theta})$ is to gradually improve the LLM's parameter $\theta$ towards the policy gradient direction $\nabla_{\theta} J(\pi_{\theta})$. However, since $ \nabla_{\theta} \mathbb{E}_{o \sim \pi_{\theta}(\cdot | q)} r(o, a) $ is the gradient of an integral dependent on $ \pi_\theta $, $ \nabla_{\theta} J(\pi_{\theta}) $ is intractable to compute via standard Monte Carlo sampling. Fortunately, the RL community has developed two powerful policy gradient estimators: REINFORCE (, (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696 ) and Importance Sampling ( & , & (). Retrieved from http://incompleteideas.net/book/the-book-2nd.html ):

$$\small \begin{align*} \nabla_{\theta} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} r(o, a) = \begin{cases} \mathbb{E}_{o \sim \pi_{\theta}(\cdot|q)} \left[ \nabla_{\theta} \log \pi_{\theta}(o|q) \cdot r(o, a) \right]\ &\text{(REINFORCE)}, \\ \mathbb{E}_{o \sim \pi_{\theta'}(\cdot|q)} \left[ \nabla_{\theta} \left( \frac{\pi_{\theta}(o|q)}{\pi_{\theta'}(o|q)} \right) \cdot r(o, a) \right]\ &\text{(Importance Sampling)}, \end{cases} \end{align*} $$

where $\pi_{\theta'}$ is any parameter-frozen LLM policy. Hence, the policy gradient $\nabla_{\theta} J(\pi_\theta)$ can be effectively approximated using standard Monte Carlo sampling: for each data pair $(q,a)$, we independently generate $G$ responses to $q$, denoted by $\{o_i\}_{i=1}^G$, using the current LLM $\pi_\theta$ or the frozen LLM $\pi_{\theta'}$, and then approximate the policy gradient estimators by

$$\footnotesize \begin{align*} \nabla_{\theta} J(\pi_\theta) = \begin{cases} \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \log \pi_{\theta}(o_i|q) \cdot r(o_i, a) \right] &\text{(REINFORCE)}, \\ \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta'}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta'}(o_i|q)} \right) \cdot r(o_i, a) \right] & \text{(Importance Sampling)}, \end{cases} \end{align*} $$

For each query $q$, the procedure of generating $G$ independent responses $\{o_{i}\}_{j=1}^G$ from $\pi_{\theta}(\cdot|q)$ is called the `rollout phase'. During this phase, the LLM policy explores enormous response samples of varying quality. Then $\theta$ is updated to increase the likelihood $ \pi_{\theta}(o_i|q) $ where $ r(o_i, a)$ is large, thereby improving the likelihood of generating responses with high rewards. Specifically, REINFORCE is an on-policy method that requires generating new rollouts using the latest LLM policy $ \pi_{\theta} $. In contrast, the importance sampling estimator can be implemented in an off-policy manner with improved sampling efficiency, as it can reuse past rollouts generated from $\pi_{\theta'}$ by storing the corresponding probability terms $ \pi_{\theta'}(o_i | q)$. A common choice is to implement $\pi_{\theta'}$ as $\pi_{\theta_{\mathrm{old}}}$, a past snapshot of the target LLM $\pi_\theta$, which is updated periodically.

In practice, the reward signals $\{r(o_i, a)\}_{i=1}^G$ are highly sparse, leading to high variance in rollout phases and policy gradient estimation. To mitigate these issues, various techniques have been developed to stabilize LLM policy gradient estimation in (\ref{eq:rl_policy_grad_estimator}). These techniques generally fall into three categories: 1) reducing sampling variance by reward normalization or using actor-critic advantage estimation, 2) stabilizing parameter updates by clipping the importance sampling weight $ \pi_\theta(o_i|q) / \pi_{\theta_{\mathrm{old}}}(o_i|q)$, and 3) constraining policy shifts by penalizing the KL-divergence $\mathrm{KL}(\pi_{\theta} | \pi_{\mathrm{ref}})$ between the current LLM policy $\pi_{\theta}$ and a fixed reference LLM policy $\pi_{\mathrm{ref}}$.

RLVR

PPO

Since its introduction in ( et al., et al. (). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347 ), Proximal Policy Optimization (PPO) has become one of the most popular actor-critic RL algorithms for LLM policy optimization ( et al., et al. (). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c ; et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ). In addition to the target LLM policy $\pi_\theta$, which serves as the actor model, PPO introduces a critic model $V_{\phi} $—another LLM designed to learn the value for the responses generated by the actor LLM $\pi_\theta$. Specifically, the PPO objective is

$$\small \begin{equation*} \begin{aligned} J_{\mathrm{PPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}(\phi)}, \textcolor{red}{\operatorname{clip}}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}(\phi)} \Big) \right) \Bigg], \end{aligned} \end{equation*} $$

where $r_{i,t}(\theta)\triangleq \pi_{\theta}(o_{i,t}|q,o_{i, < t})/ \pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i, < t})$ denotes the likelihood ratio between the current LLM policy $\pi_\theta$ and the past LLM policy $\pi_{\theta'}$ calculated on the $t$-th token prediction step; $\hat{A}_{i, t}(\phi)$ denotes the Generalized Advantage Estimator (GAE) ( et al., et al. (). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438 ) computed using the estimated value $V_{\phi}(o_{i,t}|q,o_{i, < t})$, which estimates the quality of each response generation state. $ V_{\phi} $ is trained along with $\pi_{\theta}$ to predict the value of the response generated by $\pi_{\theta}$. In practice ( et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ), GAE is observed to be a more robust response quality estimator than the raw reward $r(q_i,a^*_i)$, leading to more stable LLM policy optimization.

GRPO

Group Relative Policy Optimization (GRPO) is first proposed ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) as an effective and efficient variant of PPO. Specifically, GRPO discards the critic model and GAE calculation in PPO to improve efficiency and memory consumption. To reduce the reward sampling variance, GRPO normalizes the rewards within a group of $G$ rollout outs. In addition to clipping the likelihood ratio terms, GRPO further introduces KL-divergence penalty to ensure that $\pi_\theta$ would not be driven far away from the initial SFT LLM. Specifically, the GRPO objective is

$$\footnotesize \begin{equation*} \begin{align*} J_{\mathrm{GRPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}}, \operatorname{clip}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}} \Big) - \textcolor{red}{\beta \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})}_{i,t} \right) \Bigg], \end{align*} \end{equation*} $$

where $\hat{A}_{i,t} \triangleq (r(o_i,a) - \mathrm{mean}(\{r(o_i,a)\}_{i=1}^G))/\mathrm{std}(\{r(o_i,a)\}_{i=1}^G)$ denotes the group relative reward, and $\mathbf{r} \triangleq \{r(o_i,a)\}_{i=1}^G$ denotes the rewards of the response group corresponding to each sample $(q,a)$. GRPO also incorporates the K3 KL-divergence estimator (, (). Retrieved from http://joschu.net/blog/kl-approx.html ):

$$ \begin{align*} \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})_{i,t} \triangleq \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - \log \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - 1. \end{align*} $$

DeepSeek-R1 ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) shows that GRPO achieves stable large-scale LLM policy optimization that incentivises the long CoT pattern in large-scale LLMs.

Distillation

The following concepts are based on ( et al., et al. (). Distilling the Knowledge in a Neural Network. https://doi.org/10.48550/arXiv.1503.02531 ). They are currently being written.

Gradient Accumulation

Definition 3 (Gradient Accumulation).  Below is an explanation on how to adjust the gradient accumulation steps when changing the number of GPUs. The goal is to keep the Global Batch Size constant. The formula for it is:

$$\footnotesize \text{Global Batch Size} = (\text{Number of GPUs}) \times (\text{Per-Device Batch Size}) \times (\text{Gradient Accumulation Steps}) $$

Derivation: The goal is to show how to adjust the parameters to keep the Global Batch Size constant when the number of GPUs changes. Suppose we use the subscript _old for the original values and _new for the new values. The principle is:

$$ \text{Global Batch Size}_{old} = \text{Global Batch Size}_{new} $$

Therefore:

$$\footnotesize (\text{GPUs}_{old} \times \text{Batch Size}_{old} \times \text{Accumulation}_{old}) = (\text{GPUs}_{new} \times \text{Batch Size}_{new} \times \text{Accumulation}_{new}) $$

To find the new required number of gradient accumulation steps, the formula is rearranged as follows:

$$ \text{Accumulation}_{new} = \frac{(\text{GPUs}_{old} \times \text{Batch Size}_{old} \times \text{Accumulation}_{old})}{(\text{GPUs}_{new} \times \text{Batch Size}_{new})} $$

This shows how the gradient accumulation steps are affected by the number of GPUs.

Example 1 (Gradient Accumulation). Suppose the following values are given:

  • GPUs_old: 8
  • Batch Size_old: 16
  • Accumulation_old: 4
  • GPUs_new: 1
  • Batch Size_new: 16 (this is limited by the GPU’s memory)

Plugging these into the formula to find the new gradient accumulation steps:

$$ \text{Accumulation}_{new} = \frac{(8 \times 16 \times 4)}{(1 \times 16)} = \frac{512}{16} = 32 $$

This is why the gradient_accumulation_steps should be updated to 32 in the configuration file.

LoRA Theory

Local development leverages LoRA to reduce the number of trained parameters, effectively facilitating local debugging and development.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (July 2025). "LinAlgZero: Theory". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/theory/.

or

@misc{vasile2025linalgzero,
    title = "LinAlgZero: Theory",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2025",
    month = "July",
    url = "https://atomwalk12.github.io/posts/lingalgzero/theory/"
}

References

  1. [1] et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948
  2. [2] et al. (). 100 Days After DeepSeek-R1: A Survey on Replication Studies and More Directions for Reasoning Language Models. https://doi.org/10.48550/arXiv.2505.00551
  3. [3] et al. (). Distilling the Knowledge in a Neural Network. https://doi.org/10.48550/arXiv.1503.02531
  4. [4] et al. (). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438
  5. [5] et al. (). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599
  6. [6] (). Retrieved from http://joschu.net/blog/kl-approx.html
  7. [7] et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290
  8. [8] et al. (). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347
  9. [9] (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696
  10. [10] et al. (). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c
  11. [11] & (). Retrieved from http://incompleteideas.net/book/the-book-2nd.html