Theory
Supervised Fine-tuning
Definition 1 (SFT Loss). Given a dataset $\mathcal{D}_{\text{SFT}} \triangleq \left\{ (q_i, c_i) \right\}_{i=1}^{|\mathcal{D}|}$, where each sample $(q_i, c_i)$ consists of a question $q_i$ and a long CoT $c_i$. The long CoT can be further decomposed into a complex intermediate rationale followed by a final answer. SFT updates the parameters of the policy model $\pi_{\theta}$ by minimizing the negative log-likelihood loss:
$$ \begin{aligned} \mathcal{L}_{\text{SFT}}(\theta) \triangleq - \mathbb{E}_{(q, c) \sim \mathcal{D}_{\text{SFT}}} \left[ \log \pi_{\theta}(c \mid q) \right], \end{aligned} $$
where $\pi_{\theta}(c \mid q)$ denotes the probability assigned by the policy to the CoT response $c$ conditioned on the question $q$. This objective encourages the model to imitate the supervised demonstrations by maximizing the likelihood of the reference completions.
In the following example, if each sample is equally likely to be selected, then the expectation operator is averaging the negative log-likelihood loss across all training examples in the dataset:$$ \mathbb{E}_{(q,c) \sim \mathcal{D}_{\text{SFT}}}[\log \pi_\theta(c | q)] = \frac{1}{|\mathcal{D}_{\text{SFT}}|} \sum_{i=1}^{|\mathcal{D}_{\text{SFT}}|} \log \pi_\theta(c_i | q_i) $$
LLM Policy Optimization
Recent studies have introduced a groundbreaking post-training paradigm that enhances LLMs' reasoning capabilities through RL-based training. In this framework, the LLM's answer generation process for each query is formulated as an answer sampling policy, and our objective is to optimize this LLM policy to maximize the expected reward of the generated responses. According to (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ; Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ; Team et al., 2025 Team Kimi, et al. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599 ), large-scale RL-based LLM policy optimization enables the base LLM to achieve a steady improvement in reasoning accuracy while also exhibiting the emergence of long-chain reasoning in its chain-of-thought.
Definition 2 (LLM Policy Optimization). Suppose each reasoning data pair $(q,a)$ is i.i.d sampled from an underlying distribution $\mathcal{D}$, where each $q$ is a query and $a$ is the corresponding ground-truth answer. Let $\pi_{\theta}(\cdot | \cdot)$ be the target LLM policy parameterized by $\theta$. The expected reward of the LLM on a sample $(q,a)$ is $\mathbb{E}_{o\sim \pi_{\theta}(\cdot | q)} [r(o, a)]$, where $o$ is an LLM-generated response to $q$, and $r(\cdot,\cdot)$ is a predefined reward function that quantifies whether the response $o$ yields $a$. The objective of RL-based fine-tuning is to maximize the expected reward over the data distribution, i.e.,
$$ \begin{align*} \max_{\theta} J(\pi_{\theta}) \triangleq \mathbb{E}_{(q,a)\sim \mathcal{D}} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} [r(o, a)]. \end{align*} $$
A straightforward approach to maximize $J(\pi_{\theta})$ is to gradually improve the LLM's parameter $\theta$ towards the policy gradient direction $\nabla_{\theta} J(\pi_{\theta})$. However, since $ \nabla_{\theta} \mathbb{E}_{o \sim \pi_{\theta}(\cdot | q)} r(o, a) $ is the gradient of an integral dependent on $ \pi_\theta $, $ \nabla_{\theta} J(\pi_{\theta}) $ is intractable to compute via standard Monte Carlo sampling. Fortunately, the RL community has developed two powerful policy gradient estimators: REINFORCE (Williams, 1992 Williams Ronald J.. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696 ) and Importance Sampling (Sutton & Barto, 2025 Sutton Richard & Barto Andrew.(2025, 6/27). Retrieved from http://incompleteideas.net/book/the-book-2nd.html ):
$$\small \begin{align*} \nabla_{\theta} \mathbb{E}_{o\sim \pi_{\theta}(\cdot|q)} r(o, a) = \begin{cases} \mathbb{E}_{o \sim \pi_{\theta}(\cdot|q)} \left[ \nabla_{\theta} \log \pi_{\theta}(o|q) \cdot r(o, a) \right]\ &\text{(REINFORCE)}, \\ \mathbb{E}_{o \sim \pi_{\theta'}(\cdot|q)} \left[ \nabla_{\theta} \left( \frac{\pi_{\theta}(o|q)}{\pi_{\theta'}(o|q)} \right) \cdot r(o, a) \right]\ &\text{(Importance Sampling)}, \end{cases} \end{align*} $$
where $\pi_{\theta'}$ is any parameter-frozen LLM policy. Hence, the policy gradient $\nabla_{\theta} J(\pi_\theta)$ can be effectively approximated using standard Monte Carlo sampling: for each data pair $(q,a)$, we independently generate $G$ responses to $q$, denoted by $\{o_i\}_{i=1}^G$, using the current LLM $\pi_\theta$ or the frozen LLM $\pi_{\theta'}$, and then approximate the policy gradient estimators by
$$\small \begin{align*} \nabla_{\theta} J(\pi_\theta) = \begin{cases} \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \log \pi_{\theta}(o_i|q) \cdot r(o_i, a) \right] &\text{(REINFORCE)}, \\ \mathbb{E}_{(q,a)\sim \mathcal{D},\{o_i\}_{i=1}^G\sim \pi_{\theta'}(\cdot|q)} \left[ \frac{1}{G} \sum_{i=1}^G \nabla_{\theta} \left( \frac{\pi_{\theta}(o_i|q)}{\pi_{\theta'}(o_i|q)} \right) \cdot r(o_i, a) \right] & \text{(Importance Sampling)}, \end{cases} \end{align*} $$
For each query $q$, the procedure of generating $G$ independent responses $\{o_{i}\}_{j=1}^G$ from $\pi_{\theta}(\cdot|q)$ is called the `rollout phase'. During this phase, the LLM policy explores enormous response samples of varying quality. Then $\theta$ is updated to increase the likelihood $ \pi_{\theta}(o_i|q) $ where $ r(o_i, a)$ is large, thereby improving the likelihood of generating responses with high rewards. Specifically, REINFORCE is an on-policy method that requires generating new rollouts using the latest LLM policy $ \pi_{\theta} $. In contrast, the importance sampling estimator can be implemented in an off-policy manner with improved sampling efficiency, as it can reuse past rollouts generated from $\pi_{\theta'}$ by storing the corresponding probability terms $ \pi_{\theta'}(o_i | q)$. A common choice is to implement $\pi_{\theta'}$ as $\pi_{\theta_{\mathrm{old}}}$, a past snapshot of the target LLM $\pi_\theta$, which is updated periodically.
In practice, the reward signals $\{r(o_i, a)\}_{i=1}^G$ are highly sparse, leading to high variance in rollout phases and policy gradient estimation. To mitigate these issues, various techniques have been developed to stabilize LLM policy gradient estimation in (\ref{eq:rl_policy_grad_estimator}). These techniques generally fall into three categories: 1) reducing sampling variance by reward normalization or using actor-critic advantage estimation, 2) stabilizing parameter updates by clipping the importance sampling weight $ \pi_\theta(o_i|q) / \pi_{\theta_{\mathrm{old}}}(o_i|q)$, and 3) constraining policy shifts by penalizing the KL-divergence $\mathrm{KL}(\pi_{\theta} | \pi_{\mathrm{ref}})$ between the current LLM policy $\pi_{\theta}$ and a fixed reference LLM policy $\pi_{\mathrm{ref}}$.
PPO
Since its introduction in (Schulman et al., 2017 Schulman John, et al. (2017). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347 ), Proximal Policy Optimization (PPO) has become one of the most popular actor-critic RL algorithms for LLM policy optimization (Ouyang et al., 2022 Ouyang Long, et al. (2022). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c ; Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ). In addition to the target LLM policy $\pi_\theta$, which serves as the actor model, PPO introduces a critic model $V_{\phi} $—another LLM designed to learn the value for the responses generated by the actor LLM $\pi_\theta$. Specifically, the PPO objective is
$$\small \begin{equation*} \begin{aligned} J_{\mathrm{PPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}(\phi)}, \textcolor{red}{\operatorname{clip}}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}(\phi)} \Big) \right) \Bigg], \end{aligned} \end{equation*} $$
where $r_{i,t}(\theta)\triangleq \pi_{\theta}(o_{i,t}|q,o_{i, < t})/ \pi_{\theta_{\mathrm{old}}}(o_{i,t}|q,o_{i, < t})$ denotes the likelihood ratio between the current LLM policy $\pi_\theta$ and the past LLM policy $\pi_{\theta'}$ calculated on the $t$-th token prediction step; $\hat{A}_{i, t}(\phi)$ denotes the Generalized Advantage Estimator (GAE) (Schulman et al., 2018 Schulman John, et al. (2018). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438 ) computed using the estimated value $V_{\phi}(o_{i,t}|q,o_{i, < t})$, which estimates the quality of each response generation state. $ V_{\phi} $ is trained along with $\pi_{\theta}$ to predict the value of the response generated by $\pi_{\theta}$. In practice (Hu et al., 2025 Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290 ), GAE is observed to be a more robust response quality estimator than the raw reward $r(q_i,a^*_i)$, leading to more stable LLM policy optimization.
GRPO
Group Relative Policy Optimization (GRPO) is first proposed (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) as an effective and efficient variant of PPO. Specifically, GRPO discards the critic model and GAE calculation in PPO to improve efficiency and memory consumption. To reduce the reward sampling variance, GRPO normalizes the rewards within a group of $G$ rollout outs. In addition to clipping the likelihood ratio terms, GRPO further introduces KL-divergence penalty to ensure that $\pi_\theta$ would not be driven far away from the initial SFT LLM. Specifically, the GRPO objective is
$$\footnotesize \begin{equation*} \begin{align*} J_{\mathrm{GRPO}}(\pi_\theta) \triangleq &\ \mathbb{E}_{(q, a) \sim \mathcal{D},\left\{o_i\right\}_{i=1}^G \sim \pi_{\theta_{\text{old}}}(\cdot \mid q)} \\ \Bigg[ & \frac{1}{G} \sum_{i=1}^G \frac{1}{\left| o_i \right|} \sum_{t=1}^{\left|o_i\right|} \left( \min \Big( r_{i, t}(\theta) \textcolor{red}{\hat{A}_{i, t}}, \operatorname{clip}\left(r_{i, t}(\theta), 1 - \varepsilon, 1 + \varepsilon\right) \textcolor{red}{\hat{A}_{i, t}} \Big) - \textcolor{red}{\beta \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})}_{i,t} \right) \Bigg], \end{align*} \end{equation*} $$
where $\hat{A}_{i,t} \triangleq (r(o_i,a) - \mathrm{mean}(\{r(o_i,a)\}_{i=1}^G))/\mathrm{std}(\{r(o_i,a)\}_{i=1}^G)$ denotes the group relative reward, and $\mathbf{r} \triangleq \{r(o_i,a)\}_{i=1}^G$ denotes the rewards of the response group corresponding to each sample $(q,a)$. GRPO also incorporates the K3 KL-divergence estimator (Schulman, 2020 Schulman John.(2020, 3/7). Retrieved from http://joschu.net/blog/kl-approx.html ):
$$ \begin{align*} \mathrm{KL}(\pi_{\theta}|\pi_{\mathrm{ref}})_{i,t} \triangleq \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - \log \frac{\pi_{\mathrm{ref}}(o_{i,t}|q,o_{i, < t})}{\pi_{\theta}(o_{i,t}|q,o_{i, < t})} - 1. \end{align*} $$
DeepSeek-R1 (DeepSeek-AI et al., 2025 DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948 ) shows that GRPO achieves stable large-scale LLM policy optimization that incentivises the long CoT pattern in large-scale LLMs.
Citation
If you found this post useful for your work, please consider citing it as:
Razvan Florian Vasile. (June 2025). "LinAlgZero: Theory". Atom Blog. Retrieved from https://atomwalk12.github.io/posts/lingalgzero/theory/.
or
@misc{vasile2025linalgzero,
title = "LinAlgZero: Theory",
author = "Razvan Florian Vasile",
note = "Personal blog",
year = "2025",
month = "June",
url = "https://atomwalk12.github.io/posts/lingalgzero/theory/"
}
References
- [1] DeepSeek-AI et al. (2025). DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning. https://doi.org/10.48550/arXiv.2501.12948
- [2] Schulman John, et al. (2018). High-Dimensional Continuous Control Using Generalized Advantage Estimation. https://doi.org/10.48550/arXiv.1506.02438
- [3] Team Kimi, et al. (2025). Kimi k1.5: Scaling Reinforcement Learning with LLMs. https://doi.org/10.48550/arXiv.2501.12599
- [4] Schulman John.(2020, 3/7). Retrieved from http://joschu.net/blog/kl-approx.html
- [5] Hu Jingcheng, et al. (2025). Open-Reasoner-Zero: An Open Source Approach to Scaling Up Reinforcement Learning on the Base Model. https://doi.org/10.48550/ARXIV.2503.24290
- [6] Schulman John, et al. (2017). Proximal Policy Optimization Algorithms. https://doi.org/10.48550/arXiv.1707.06347
- [7] Williams Ronald J.. (1992). Simple statistical gradient-following algorithms for connectionist reinforcement learning. Machine Learning, 8(3). 229–256. https://doi.org/10.1007/BF00992696
- [8] Ouyang Long, et al. (2022). Training language models to follow instructions with human feedback. ArXiv. Retrieved from https://www.semanticscholar.org/paper/Training-language-models-to-follow-instructions-Ouyang-Wu/d766bffc357127e0dc86dd69561d5aeb520d6f4c
- [9] Sutton Richard & Barto Andrew.(2025, 6/27). Retrieved from http://incompleteideas.net/book/the-book-2nd.html