Skip to content

RL

CPI

回顾一下 value 函数与 state-action value 函数以及 advantage 函数:

\[V_{\pi}(s_t)=\mathbb{E}_{a_t, s_{t+1},a_{t+1},\cdots}[\sum_{l=0}^{\infty} \gamma^lr(s_{t+l})]\]
\[Q_{\pi}(s_t,a_t)=\mathbb{E}_{s_{t+1},a_{t+1},\cdots}[\sum_{l=0}^{\infty} \gamma^lr(s_{t+l})]\]
\[A_{\pi}(s,a)=Q_{\pi}(s,a)-V_{\pi}(s)\]

其中

\[a_t\sim \pi(a_t|s_t), s_{t+1}\sim P(s_{t+1}|s_t,a_t),t\geq 0\]
\[L_{\pi}(\tilde\pi)=\eta(\pi)+\sum_s\rho_\pi(s)\sum_a\tilde\pi(a|s)A_{\pi}(s,a)\]

那么 CPI 代表了这样的策略:

\[\pi_{\text{new}}(a|s)=(1-\alpha)\pi_{\text{old}}(a|s)+\alpha\pi'(a|s)\]

那么有下界:

\[\eta(\pi_{\text{new}})\geq L_{\pi_{\text{old}}}(\pi_{\text{new}})-\frac{2\epsilon\gamma}{(1-\gamma)^2}\alpha^2\]

其中

\[\epsilon = \max_{s}|\mathbb{E}_{a\sim \pi'(a|s)}[A_{\pi}(s,a)]|\]

具体见 博客

TPRO

PPO

GPRO

DeepSeek-R1-Zero

无监督的纯粹强化学习

使用强化学习训练, 有助于学习到思考链 (CoT = Chain of Thought)

同时监督数据很难获得, 所以使用无监督数据是比较好的

强化学习算法


参考:

  1. Approximately Optimal Approximate Reinforcement Learning
  2. Proximal Policy Optimization Algorithms
  3. Trust Region Policy Optimization
  4. DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
  5. DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models