Theory

Supervised Fine-tuning

Definition 1 (SFT Loss).  Given a dataset $\mathcal{D}_{\text{SFT}} \triangleq \left\{ (q_i, c_i) \right\}_{i=1}^{|\mathcal{D}|}$, where each sample $(q_i, c_i)$ consists of a question $q_i$ and a long CoT $c_i$. The long CoT can be further decomposed into a complex intermediate rationale followed by a final answer. SFT updates the parameters of the policy model $\pi_{\theta}$ by minimizing the negative log-likelihood loss:

$$ \begin{aligned} \mathcal{L}_{\text{SFT}}(\theta) \triangleq - \mathbb{E}_{(q, c) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_{\theta}(c \mid q) \right], \end{aligned} $$

where $\pi_{\theta}(c \mid q)$ denotes the probability assigned by the policy to the CoT response $c$ conditioned on the question $q$. This objective encourages the model to imitate the supervised demonstrations by maximizing the likelihood of the reference completions.

In the following example, if each sample is equally likely to be selected, then the expectation operator is averaging the negative log-likelihood loss across all training examples in the dataset:

$$ \mathbb{E}_{(q,c) \sim \mathcal{D}_{\text{SFT}}}[\log \pi_\theta(c | q)] = \frac{1}{|\mathcal{D}_{\text{SFT}}|} \sum_{i=1}^{|\mathcal{D}_{\text{SFT}}|} \log \pi_\theta(c_i | q_i) $$

LLM Policy Optimization

Recent studies have introduced a groundbreaking post-training paradigm that enhances LLMs' reasoning capabilities through RL-based training. In this framework, the LLM's answer generation process for each query is formulated as an answer sampling policy, and our objective is to optimize this LLM policy to maximize the expected reward of the generated responses. According to ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ; et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ; et al., et al. (). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599 ), large-scale RL-based LLM policy optimization enables the base LLM to achieve a steady improvement in reasoning accuracy while also exhibiting the emergence of long-chain reasoning in its chain-of-thought.

Definition 2 (LLM Policy Optimization).  Suppose each reasoning data pair $(q,a)$ is i.i.d sampled from an underlying distribution $\mathcal{D}$, where each $q$ is a query and $a$ is the corresponding ground-truth answer. Let $\pi_{\theta}(\cdot | \cdot)$ be the target LLM policy parameterized by $\theta$. The expected reward of the LLM on a sample $(q,a)$ is $\mathbb{E}_{o\sim \pi_{\theta}(\cdot | q)} [r(o, a)]$, where $o$ is an LLM-generated response to $q$, and $r(\cdot,\cdot)$ is a predefined reward function that quantifies whether the response $o$ yields $a$. The objective of RL-based fine-tuning is to maximize the expected reward over the data distribution, i.e.,

$$ \begin{align*} \max_{\theta} J(\pi_{\theta}) \triangleq \mathbb{E}_{(q,a)\sim \mathcal{D}} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} [r(o, a)]. \end{align*} $$

A straightforward approach to maximize $J(\pi_{\theta})$ is to gradually improve the LLM's parameter $\theta$ towards the policy gradient direction $\nabla_{\theta} J(\pi_{\theta})$. However, since $ \nabla_{\theta} \mathbb{E}_{o \sim \pi_{\theta}(\cdot | q)} r(o, a) $ is the gradient of an integral dependent on $ \pi_\theta $, $ \nabla_{\theta} J(\pi_{\theta}) $ is intractable to compute via standard Monte Carlo sampling. Fortunately, the RL community has developed two powerful policy gradient estimators: REINFORCE (, (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696 ) and Importance Sampling ( & , & (). Retrieved from http://incompleteideas.net/book/the-book-2nd.html ):

$$\small \begin{align*} \nabla_{\theta} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} r(o, a) = \begin{cases} \mathbb{E}_{o \sim \pi_{\theta}(\cdot|q)} \left[ \nabla_{\theta} \log \pi_{\theta}(o|q) \cdot r(o, a) \right]\ &\text{(REINFORCE)}, \\ \mathbb{E}_{o \sim \pi_{\theta'}(\cdot|q)} \left[ \nabla_{\theta} \left( \frac{\pi_{\theta}(o|q)}{\pi_{\theta'}(o|q)} \right) \cdot r(o, a) \right]\ &\text{(Importance Sampling)}, \end{cases} \end{align*} $$

where $\pi_{\theta'}$ is any parameter-frozen LLM policy. Hence, the policy gradient $\nabla_{\theta} J(\pi_\theta)$ can be effectively approximated using standard Monte Carlo sampling: for each data pair $(q,a)$, we independently generate $G$ responses to $q$, denoted by $\{o_i\}_{i=1}^G$, using the current LLM $\pi_\theta$ or the frozen LLM $\pi_{\theta'}$, and then approximate the policy gradient estimators by

$$\small \begin{align*} \nabla_{\theta} J(\pi_\theta) = \begin{cases} \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \log \pi_{\theta}(o_i|q) \cdot r(o_i, a) \right] &\text{(REINFORCE)}, \\ \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta'}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta'}(o_i|q)} \right) \cdot r(o_i, a) \right] & \text{(Importance Sampling)}, \end{cases} \end{align*} $$

For each query $q$, the procedure of generating $G$ independent responses $\{o_{i}\}_{j=1}^G$ from $\pi_{\theta}(\cdot|q)$ is called the `rollout phase'. During this phase, the LLM policy explores enormous response samples of varying quality. Then $\theta$ is updated to increase the likelihood $ \pi_{\theta}(o_i|q) $ where $ r(o_i, a)$ is large, thereby improving the likelihood of generating responses with high rewards. Specifically, REINFORCE is an on-policy method that requires generating new rollouts using the latest LLM policy $ \pi_{\theta} $. In contrast, the importance sampling estimator can be implemented in an off-policy manner with improved sampling efficiency, as it can reuse past rollouts generated from $\pi_{\theta'}$ by storing the corresponding probability terms $ \pi_{\theta'}(o_i | q)$. A common choice is to implement $\pi_{\theta'}$ as $\pi_{\theta_{\mathrm{old}}}$, a past snapshot of the target LLM $\pi_\theta$, which is updated periodically.

In practice, the reward signals $\{r(o_i, a)\}_{i=1}^G$ are highly sparse, leading to high variance in rollout phases and policy gradient estimation. To mitigate these issues, various techniques have been developed to stabilize LLM policy gradient estimation in (\ref{eq:rl_policy_grad_estimator}). These techniques generally fall into three categories: 1) reducing sampling variance by reward normalization or using actor-critic advantage estimation, 2) stabilizing parameter updates by clipping the importance sampling weight $ \pi_\theta(o_i|q) / \pi_{\theta_{\mathrm{old}}}(o_i|q)$, and 3) constraining policy shifts by penalizing the KL-divergence $\mathrm{KL}(\pi_{\theta} | \pi_{\mathrm{ref}})$ between the current LLM policy $\pi_{\theta}$ and a fixed reference LLM policy $\pi_{\mathrm{ref}}$.

PPO

Since its introduction in ( et al., et al. (). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347 ), Proximal Policy Optimization (PPO) has become one of the most popular actor-critic RL algorithms for LLM policy optimization ( et al., et al. (). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c ; et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ). In addition to the target LLM policy $\pi_\theta$, which serves as the actor model, PPO introduces a critic model $V_{\phi} $—another LLM designed to learn the value for the responses generated by the actor LLM $\pi_\theta$. Specifically, the PPO objective is

$$\small \begin{equation*} \begin{aligned} J_{\mathrm{PPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}(\phi)}, \textcolor{red}{\operatorname{clip}}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}(\phi)} \Big) \right) \Bigg], \end{aligned} \end{equation*} $$

where $r_{i,t}(\theta)\triangleq \pi_{\theta}(o_{i,t}|q,o_{i, < t})/ \pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i, < t})$ denotes the likelihood ratio between the current LLM policy $\pi_\theta$ and the past LLM policy $\pi_{\theta'}$ calculated on the $t$-th token prediction step; $\hat{A}_{i, t}(\phi)$ denotes the Generalized Advantage Estimator (GAE) ( et al., et al. (). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438 ) computed using the estimated value $V_{\phi}(o_{i,t}|q,o_{i, < t})$, which estimates the quality of each response generation state. $ V_{\phi} $ is trained along with $\pi_{\theta}$ to predict the value of the response generated by $\pi_{\theta}$. In practice ( et al., et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ), GAE is observed to be a more robust response quality estimator than the raw reward $r(q_i,a^*_i)$, leading to more stable LLM policy optimization.

GRPO

Group Relative Policy Optimization (GRPO) is first proposed ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) as an effective and efficient variant of PPO. Specifically, GRPO discards the critic model and GAE calculation in PPO to improve efficiency and memory consumption. To reduce the reward sampling variance, GRPO normalizes the rewards within a group of $G$ rollout outs. In addition to clipping the likelihood ratio terms, GRPO further introduces KL-divergence penalty to ensure that $\pi_\theta$ would not be driven far away from the initial SFT LLM. Specifically, the GRPO objective is

$$\footnotesize \begin{equation*} \begin{align*} J_{\mathrm{GRPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}}, \operatorname{clip}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}} \Big) - \textcolor{red}{\beta \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})}_{i,t} \right) \Bigg], \end{align*} \end{equation*} $$

where $\hat{A}_{i,t} \triangleq (r(o_i,a) - \mathrm{mean}(\{r(o_i,a)\}_{i=1}^G))/\mathrm{std}(\{r(o_i,a)\}_{i=1}^G)$ denotes the group relative reward, and $\mathbf{r} \triangleq \{r(o_i,a)\}_{i=1}^G$ denotes the rewards of the response group corresponding to each sample $(q,a)$. GRPO also incorporates the K3 KL-divergence estimator (, (). Retrieved from http://joschu.net/blog/kl-approx.html ):

$$ \begin{align*} \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})_{i,t} \triangleq \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - \log \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - 1. \end{align*} $$

DeepSeek-R1 ( et al., et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) shows that GRPO achieves stable large-scale LLM policy optimization that incentivises the long CoT pattern in large-scale LLMs.

Citation

If you found this post useful for your work, please consider citing it as:

Razvan Florian Vasile. (June 2025). "LinAlgZero: Theory". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/theory/.

or

@misc{vasile2025linalgzero,
    title = "LinAlgZero: Theory",
    author = "Razvan Florian Vasile",
    note = "Personal blog",
    year = "2025",
    month = "June",
    url = "https://atomwalk12.github.io/posts/lingalgzero/theory/"
}

References

  1. [1] et al. (). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948
  2. [2] et al. (). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438
  3. [3] et al. (). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599
  4. [4] (). Retrieved from http://joschu.net/blog/kl-approx.html
  5. [5] et al. (). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290
  6. [6] et al. (). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347
  7. [7] (). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696
  8. [8] et al. (). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c
  9. [9] & (). Retrieved from http://incompleteideas.net/book/the-book-2nd.html