OneRec Decoder

image.png

  • point-wise generation paradigm 点式生成范式
  • 解码器输入 #card
    • learnable beginning-of-sequence token with the video’s semantic identifiers
    • Sm={s[BOS],sm1,sm2,,smLt}\mathcal{S}_m=\left\{s_{[\mathrm{BOS}]}, s_m^1, s_m^2, \cdots, s_m^{L_t}\right\}
    • dm(0)=\mathbf{d}_m^{(0)}= Emb_lookup (Sm)\left(\mathcal{S}_m\right)
  • 通过 Ldec 层 Transformer layer 处理序列 #card
    image.png
  • Mixture of Experts (MoE) feed-forward network #card
    • top-k 路由策略
    • MoE(x)=j=1kGatej(x)Expertj(x)\operatorname{MoE}(\mathbf{x})=\sum_{j=1}^k \operatorname{Gate}_j(\mathbf{x}) \cdot \operatorname{Expert}_j(\mathbf{x}),
    • loss-free load balancing strategy
  • 训练目标 #card
    • NTP 交叉熵损失
    • cross-entropy loss for next-token prediction on the semantic identifiers of target video m
    • LNTP=j=1Lt1logP(smj+1[s[BOS],sm1,sm2,,smj])\mathcal{L}_{\mathrm{NTP}}=-\sum_{j=1}^{L_t-1} \log P\left(s_m^{j+1} \mid\left[s_{[\mathrm{BOS}]}, s_m^1, s_m^2, \cdots, s_m^j\right]\right)

网络回响

作者

Ryen Xiang

发布于

2026-02-17

更新于

2026-02-17

许可协议


评论