RealFormer
Layer Normalization
On Layer Normalization in the Transformer Architecture
- PostLN
{{embed [[2020/09/22]] 堵点
}}
Informer:把残差转移到Attention矩阵上面去 - 科学空间|Scientific Spaces
-
Which Training Methods for GANs do actually Converge?
- 残差每一步累积导致方差很大从 $$x+f(x)$$ 变成 $$x+\alpha f(x)$$