2026-02-172026-02-17 随手记 2 分钟读完 (大约353个字) 0次访问

Post Norm 和 Pre Norm 区别

[[Pre Norm]] ↔ $x_{n+1}=x_{n}+f\left(\operatorname{norm}\left(x_{n}\right)\right)$
- 第二项的方差由于有 norm 不会随层数变化，x 的方差在主干上随层数累积。到达深层后，单层对主干的影响很小，不同层在统计上类似。
- $x_{n+2}=x_{n+1}+f\left(\operatorname{norm}\left(x_{n+1}\right)\right)=x_{n}+f\left(\operatorname{norm}\left(x_{n}\right)\right)+f\left(\operatorname{norm}\left(x_{n+1}\right)\right) \approx x_{n}+2 f\left(\operatorname{norm}\left(x_{n}\right)\right)$
- 这样训练的深层模型更像是扩展模型宽度，相对好训练。
[[Post Norm]] ↔ $x_{n+1}=\operatorname{norm}\left(x_{n}+f\left(x_{n}\right)\right)$
- 主干方差恒定，每层对 x 都有较大影响，没有从头到尾的恒等路径，梯度难以控制，更难收敛，训练出来效果好。
- 突出残差分支
- [[BERT]]训练时，需要 {{c1 warmup}}
  - 输出层的期望梯度非常大，不稳定
  - [[Adam]] 和 [[SGD]] 都需要

pre 和 post 具体含义 #card

[[DeepNet]]

Ref

Ryen Xiang

2026-02-17

2026-02-17