PPO
[[On-Policy & Off-Policy]] :-> near on policy
- sample reuse
创新点 :-> [[importance sampling]] 和 clip 机制
在策略优化过程中,与环境交互的策略(旧策略)与要更新的策略(新策略)不同。所以需要 :-> 能够从一个分布中采样可以估计另一个分布。
用重要性采样来估计 :-> 策略梯度
- 由于采样轨迹的分布与当前策略分布之间的差异会导致 :-> 梯度估计的偏差
- 通过 比率因子 校正偏差,以更准确地估计策略梯度
[[Policy Gradient]] $g=\nabla J(\theta)=E_{a^{\sim} \pi(\theta)}\left[Q_{\pi(\theta)}\left(S_t, a_t\right) \nabla \log \pi_\theta\right]$
- 通过 比率因子 校正偏差,以更准确地估计策略梯度
- 由于采样轨迹的分布与当前策略分布之间的差异会导致 :-> 梯度估计的偏差
利用策略 $\pi^{\prime}$ 代为采样,转化 :-> [[Off-Policy]]
- 梯度 :-> $g=\nabla J(\theta)=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} Q_{\pi(\theta)^{\prime}}\left(S_t, a_t\right) \nabla \log \pi_\theta\right]$
PPO 修复引入 PG 后的重要性采样不足
- 方法 :-> 每步收益会比期望的收益好多少,也就是advantage
- 梯度 :-> $g=E_{a^{-} \pi(\theta)}\left[A_{\pi(\theta)}\left(s_t, a_t\right) \nabla \log \pi_\theta\right]$
合并重要性采样和 advantage function 后的梯度 #card
$g=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right) \nabla \log \pi_\theta\right]$
- $\pi_\theta \nabla \log \pi_\theta=\nabla \pi_\theta$ :-> $g=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\nabla \pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]$
PPO 对应的目标函数 :-> $J(\theta)^{\prime}=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]$
为了克服 采样的分布与原分布差距过大的不足 ,PPO 引入 KL 散度 进行约束。公式变成 :-> $J(\theta)^{\prime}=E_{a^{\sim} \pi(\theta)^{\prime}}\left[\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}\left(s_t, a_t\right)\right]-\beta K L\left(\pi(\theta)^{\prime}, \pi(\theta)\right)$ KL距离做正则
[[TRPO]](Trust Region Policy Optimization),要求 $K L\left(\theta, \theta^{\prime}\right)<\delta$
实际应用中基于采样估计期望,省略 KL 距离的计算,对应的目标函数 #card
- $J(\theta)^{\prime} \approx \sum_{(s, a)} \min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}\right)$
引入 KL 散度和 clip 分解解决什么问题? #card
KL 散度解决新旧分布差异
clip 解决前后策略差异
PPO Loss :-> $L^{C L I P}(\theta)=\hat{\mathbb{E}}_t\left[\min \left(r_t(\theta) \hat{A}_t, \operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t\right)\right]$
Ratio Function 算计公式 :-> $r_t(\theta)=\frac{\pi_\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)}$
- ratio function denotes the probability ratio between the current and old policy
- If $r_t(\theta)>1$, :-> the action $a_t$ at state $s_t$ is more likely in the current policy than the old policy.
- If $r_t(\theta)$ is between 0 and 1 :-> the action is less likely for the current policy than for the old one.
- ratio function denotes the probability ratio between the current and old policy
The unclipped part :-> $L^{C P I}(\theta)=\hat{\mathbb{E}}t\left[\frac{\pi\theta\left(a_t \mid s_t\right)}{\pi_{\theta_{\text {old }}}\left(a_t \mid s_t\right)} \hat{A}_t\right]=\hat{\mathbb{E}}_t\left[r_t(\theta) \hat{A}_t\right]$
The clipped Part :-> $\operatorname{clip}\left(r_t(\theta), 1-\epsilon, 1+\epsilon\right) \hat{A}_t$
- 论文限制范围 0.8 到 1.2
$\hat{A}_t$ 是 优势估计函数 ,PPO 原始论文使用 [[generalized advantage estimation]]
we take the minimum of the clipped and non-clipped objective, so the final objective is a lower bound (pessimistic bound) of the unclipped objective.
- Taking the minimum of the clipped and non-clipped objective means we’ll select either the clipped or the non-clipped objective based on the ratio and advantage situation.
限制两次更新之间有过大的策略更新
- Taking the minimum of the clipped and non-clipped objective means we’ll select either the clipped or the non-clipped objective based on the ratio and advantage situation.
TRPO (Trust Region Policy Optimization) uses KL divergence constraints outside the objective function to constrain the policy update. But this method is complicated to implement and takes more computation time.
PPO clip probability ratio directly in the objective function with its Clipped surrogate objective function.
$J(\theta)^{\prime} \approx \sum_{(s, a)} \min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}\right)$$1-\varepsilon \leq \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) \leq 1+\varepsilon$
- $A_{\pi(\theta)^{\prime}}>0$ :-> 当前策略表现好,需要增大 $\pi( \theta )$
$\min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}^{\prime}\right)$ 得 $\frac{\pi(\theta)}{\pi(\theta)^{\prime}} \leq 1+\varepsilon$
- 通过 clip 增加参数更新的上下限防止 :-> 新旧分布相差太大,引入误差
- $A_{\pi(\theta)^{\prime}}<0$ :-> 当前策略表现差
$\min \left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}} A_{\pi(\theta)^{\prime}}, \operatorname{clip}\left(\frac{\pi(\theta)}{\pi(\theta)^{\prime}}, 1-\varepsilon, 1+\varepsilon\right) A_{\pi(\theta)^{\prime}}^{\prime}\right)$ 得 $\frac{\pi(\theta)}{\pi\left(\theta \theta^{\prime}\right.} \geq 1-\varepsilon$
- clip 限制下限 :-> 需要大幅度改变 $\pi( \theta )
- $A_{\pi(\theta)^{\prime}}>0$ :-> 当前策略表现好,需要增大 $\pi( \theta )$
[[CLIP]] 控制 :-> 策略更新的幅度
- 限制 $r_t(\theta)$ 更新范围在:-> $[1-\epsilon, 1+\epsilon]$
- 作用 :-> 有助于保持算法的稳定性,并避免在单次更新中引入过大的策略变化。

[[PPO]] algorithm
系数 $\beta$ 在迭代的过程中需要进行动态调整。引入 $KL_{max} KL_{min}$,KL > KLmax,说明 penalty 没有发挥作用,增大 $\beta$。
若当前的KL距离比最大KL距离还大, 说明采样分布与更新的分布距离更大了, 表示约束还不够强力,需要 $\beta$ 增大;
反之, 若当前的KL距离比最小的KL距离还小, 则 $\beta$ 缩小。

[[Ref]]