RNN

image.png
occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTcyNzEwMTQwODE4OV8wLnBuZyI6eyJjb25maWciOnsiaGlkZUFsbFRlc3RPbmUiOnRydWV9LCJlbGVtZW50cyI6W3sibGVmdCI6MTU0NC4wNzUyNzMzNjA3MzQ0LCJ0b3AiOjUwMC42NTQzNTc4NTAwMzkxLCJ3aWR0aCI6NzIyLjE0MDMwMTkxMDIxMzEsImhlaWdodCI6MTA3LjMwMjU3NDYzMDExODY3LCJhbmdsZSI6MCwiY0lkIjoxfSx7ImxlZnQiOjE0NjUuNTcxNzQ2MTMzNjQ4NSwidG9wIjo2MTYuNjU1MTg4MjQ2MjkxNiwid2lkdGgiOjU1OS4xNDU3NzEwOTM3ODQ2LCJoZWlnaHQiOjk5LjMwMzA3NjczMDE3ODMsImFuZ2xlIjowLCJjSWQiOjJ9LHsibGVmdCI6NzAzLjQ4OTk5OTUyMTU3MjgsInRvcCI6Mzk2LjI5NzA3MzY4NDQxODI1LCJ3aWR0aCI6ODIuMDIwNTAzNjg3MzcxODQsImhlaWdodCI6NzEuNTY1NDQ4ODQ0OTE0MjMsImFuZ2xlIjowLCJjSWQiOjN9LHsibGVmdCI6ODI1LjUzMjQzMjk1ODM4MTMsInRvcCI6MTMxLjk4MTg1NjMyMTg0MjEsIndpZHRoIjoxNDcuODE2MDUxNjczNjU5MSwiaGVpZ2h0Ijo3MC44MDQyOTIyNzcyMjA3LCJhbmdsZSI6MCwiY0lkIjoyfV19fQ==
ht=tanh(Wxt+Uht1+b)h_{t}=\tanh \left(W x_{t}+U h_{t-1}+b\right)

  • dhtdθ=htht1dht1dθ+htθ\frac{d h_{t}}{d \theta}=\frac{\partial h_{t}}{\partial h_{t-1}} \frac{d h_{t-1}}{d \theta}+\frac{\partial h_{t}}{\partial \theta}

    • htht1>1|\frac{\partial h_{t}}{\partial h_{t-1}}| >1,导致 :<-> 梯度爆炸
      • 如何解决 :-> 梯度裁剪
    • htht1<1|\frac{\partial h_{t}}{\partial h_{t-1}}| < 1,导致 :<-> 梯度消失
      htht1=(1ht2)U\frac{\partial h_{t}}{\partial h_{t-1}}=\left(1-h_{t}^{2}\right) U
  • 结合 ht=tanh(Wxt+Uht1+b)h_{t}=\tanh \left(W x_{t}+U h_{t-1}+b\right)
    对应的曲线

    • 为什么隐状态激活函数使用 [[Tanh]] 而不是 [[ReLU]]?
      • 为什么用 Tanh :-> htht1\frac{\partial h_{t}}{\partial h_{t-1}} 是有界的,可以缓减梯度爆炸的风险。
      • 为什么不用 ReLU :-> 正半区没有上限
        • 将 U 初始化在单位矩阵附近 + 梯度裁剪也可以得到不错的效果
  • 如果 U 很大,ht 会接近于 1,$$\frac{\partial h_{t}}{\partial h_{t-1}}$$ 反而会小
    [[RNN/Backward]]

[[Ref]]

作者

Ryen Xiang

发布于

2024-10-05

更新于

2024-10-05

许可协议


网络回响

评论