2026-02-172026-02-17 随手记 9 分钟读完 (大约1359个字) 0次访问

PPO

[[On-Policy & Off-Policy]] → near on policy

sample reuse

在策略优化过程中，与环境交互的策略（旧策略）与要更新的策略（新策略）不同。所以需要 → 能够从一个分布中采样可以估计另一个分布。
用重要性采样来估计 → 策略梯度
- 由于采样轨迹的分布与当前策略分布之间的差异会导致 → 梯度估计的偏差
  - 通过 {{c1 比率因子}} 校正偏差，以更准确地估计策略梯度

[[Policy Gradient]] $g=\nabla J(\theta)=E_{a^{\sim} \pi(\theta)}\left[Q_{\pi(\theta)}\left(S_t, a_t\right) \nabla \log \pi_\theta\right]$

利用策略 $\pi^{\prime}$ 代为采样，转化 → [[Off-Policy]]
- 梯度 → $g=\nabla J(\theta)=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} Q_{\pi(\theta)^{\prime}}\left(S_t, a_t\right) \nabla \log \pi_\theta\right]$
PPO 修复引入 PG 后的重要性采样不足
- 方法 → 每步收益会比期望的收益好多少，也就是advantage
- 梯度 → $g=E_{a^{-} \pi(\theta)}\left[A_{\pi(\theta)}\left(s_t, a_t\right) \nabla \log \pi_\theta\right]$
合并重要性采样和 advantage function 后的梯度 #card
- $g=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right) \nabla \log \pi_\theta\right]$
  - $\pi_\theta \nabla \log \pi_\theta=\nabla \pi_\theta$ → $g=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\nabla \pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]$
PPO 对应的目标函数 → $J(\theta)^{\prime}=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]$

为了克服 {{c2 采样的分布与原分布差距过大的不足}}，PPO 引入 {{c1 KL 散度}} 进行约束。

公式变成 → $J(\theta)^{\prime}=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]-\beta K L\left(\pi(\theta)^{\prime}, \pi(\theta)\right)$ KL距离做正则
[[TRPO]](Trust Region Policy Optimization)，要求 $K L\left(\theta, \theta^{\prime}\right)<\delta$
实际应用中基于采样估计期望，省略 KL 距离的计算，对应的目标函数 #card
- $J(\theta)^{\prime} \approx \sum_{(s, a)} \min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}\right)$

引入 KL 散度和 clip 分解解决什么问题？#card

KL 散度解决新旧分布差异
clip 解决前后策略差异

PPO Loss → $L^{C L I P}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]$

Ratio Function 算计公式 → $r_t(\theta)=\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)}$
- ratio function denotes the probability ratio between {{c1 the current and old policy}}
  - If $r_t(\theta)>1$, → the action $a_t$ at state $s_t$ is more likely in the current policy than the old policy.
  - If $r_t(\theta)$ is between 0 and 1 → the action is less likely for the current policy than for the old one.
The unclipped part → $L^{C P I}(\theta)=\hat{\mathbb{E}}t\left[\frac{\pi\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)} \hat{A}_t\right]=\hat{\mathbb{E}}_t\left[r_t(\theta) \hat{A}_t\right]$
The clipped Part → $\operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t$
- 论文限制范围 0.8 到 1.2
$\hat{A}_t$ 是 {{c1 优势估计函数}}，PPO 原始论文使用 {{c1 [[generalized advantage estimation]]}}
we take the minimum of the clipped and non-clipped objective, so the final objective is a {{c1 lower bound (pessimistic bound)}} of the unclipped objective.
- Taking the minimum of the clipped and non-clipped objective means we’ll select either the clipped or the non-clipped objective based on {{c1 the ratio and advantage situation.}}

限制两次更新之间有过大的策略更新

TRPO (Trust Region Policy Optimization) uses {{c1 KL divergence}} constraints outside the objective function to constrain the policy update. But this method is complicated to implement and takes more computation time.
PPO {{c1 clip probability ratio directly}} in the objective function with its Clipped surrogate objective function.

((66dd2f00-54d1-4fbd-a04c-8e4a699c19e0))

$1-\varepsilon \leq \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) \leq 1+\varepsilon$
- $A_{\pi(\theta)^{\prime}}>0$ → 当前策略表现好，需要增大 $\pi( \theta )$
  - $\min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}^{\prime}\right)$ 得 $\frac{\pi(\theta)}{\pi(\theta)^{\prime}} \leq 1+\varepsilon$
    - 通过 clip 增加参数更新的上下限防止 → 新旧分布相差太大，引入误差
- $A_{\pi(\theta)^{\prime}}<0$ → 当前策略表现差
  - $\min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}^{\prime}\right)$ 得 $\frac{\pi(\theta)}{\pi\left(\theta \theta^{\prime}\right.} \geq 1-\varepsilon$
    - clip 限制下限 → 需要大幅度改变 $\pi( \theta )

[[CLIP]] 控制 → 策略更新的幅度

限制 $r_t(\theta)$ 更新范围在→ $[1-\epsilon, 1+\epsilon]$
作用 → 有助于保持算法的稳定性，并避免在单次更新中引入过大的策略变化。

[[PPO]] algorithm

系数 $\beta$ 在迭代的过程中需要进行动态调整。引入 $KL_{max} KL_{min}$，KL > KLmax，说明 penalty 没有发挥作用，增大 $\beta$。
- 若当前的KL距离比最大KL距离还大, 说明采样分布与更新的分布距离更大了, 表示约束还不够强力，需要 $\beta$ 增大；
- 反之, 若当前的KL距离比最小的KL距离还小, 则 $\beta$ 缩小。

[[Ref]]

PPO

https://blog.xiang578.com/post/logseq/80436.html

作者

Ryen Xiang

发布于

2026-02-17

更新于

2026-02-17

PPO

[[PPO]] algorithm

作者

发布于

更新于

许可协议

评论

目录

最新文章