3.12 Model Dictionary Compression

QR hashing (Shi et al., 2019) offers a solution by decomposing large matrices into smaller ones using quotient and remainder techniques while preserving embedding uniqueness across IDs. 通过使用商数和余数技术将大矩阵分解为较小的矩阵,同时保持不同 ID 嵌入的唯一性 #card

  • 减少 embedding 词表大小和通过MurmurHash 这样的抗冲突哈希函数消除词表维护需求

Example of non static vocab hashing paradigm #card
image.png


FM

自动特征交叉,解决特征稀疏

FM 与其他模型对比

  • 可以模拟二阶多项式核的 [[SVM]]、MF、SVD++

    • [[SVM]] 训练和预测需要计算核矩阵,核矩阵的复杂度是 N 方

    • MF 扩展性没有 FM 好,只有 u 和 i 两类特征

与 [[SVM]] 对比

  • 二阶多项式内核 SVM 二元交叉特征 wij 相互独立

    • fm 参数 nk,svm 参数 nn,更适合大规模稀疏特征,泛化能力更强
  • 核方法需要计算核矩阵

  • 样本 :-> FM 预测和训练样本独立,SVM 和支持向量有关

  • FM 在原始形式下 进行优化学习,非线性SVM通常需要在 对偶形式 下进行
    交叉项需不需要乘 value ?

  • eta 放到 xi 和 xj 泛化能力不好

FM 如何加入 index embedding?

对比 FM 和 SVM 有什么区别?

  • 特征角度 :-> 二阶多项式内核 SVM 二元交叉特征 wij 相互独立
    , ((6302f9ee-11be-4f7a-9cf9-26400d6d4601))
    为什么要用 FTRL 优化 FM #card

  • FTRL 是 SGD 算法,离线调参,减少线上风险

  • 稀疏特征, 自适应学习率效果最好(特征 i 在 t 轮迭代的学习率不同)

  • 不同特征有不同的学习速度、收敛速度快

[[Ref]]


@Embedding-based Retrieval in Facebook Search

image.png

  • 除了主要的文本特征,还增加了user和doc的位置、社交关系的side info增强 query和doc 的匹配能力。
  • 模型的训练目标#card
    • 为双塔输出向量的距离,使正样本对距离尽可能小(相似度分数尽可能大),负样本对距离尽可能大(相似度分数尽可能小)。

    • [[Triplet Loss]]

基线模型的样本构造也比较简单,使用query-doc的点击pair对作为正样本对,负样本有两种选择:#card

  • 随机负采样:对每一个query随机从doc池中采样相应比例的负样本。

  • 曝光未点击的样本:对于每一个query,随机从session内曝光未点击的样本作为负样本。

  • 文中实验显示前者的效果明显强于后者,原因在于后者使得训练样本和后续预测样本有明显的分布不一致,即存在严重的样本选择偏差问题。

向量召回问题

  • 候选集离线训练和线上服务的压力

  • matching 问题

[[新召回往往会存在后链路低估的问题,如何克服这个问题带来增量?]] #card

  • 将召回生成的embedding作为ranking阶段的特征,可以直接将embedding作为特征或者计算query和doc的embedding各种相似度,通过大量实验证明,consine similarity有较好的结果。

  • 为了解决向量召回准确率较低的问题,将向量召回的结果直接进行人工标注,然后再基于标注的结果进行训练。这种方法比较暴力并且效率比较低。

Ref


@LiRank: Industrial Large Scale Ranking Models at LinkedIn

想法

  • 主要是工程实践经验,完全覆盖搜广推系统的方方面面。如果没有遇到过相关的问题,看起来完全是天书。当成是手册查询吧。

Large Ranking Models

Training scalability

  • [[4.1 4D Model Parallelism]] 解决训练过程中 embedding 表梯度同步存在性能瓶颈

  • [[4.2 Avro Tensor Dataset Loader]] 解决训练过程中 io 瓶颈

  • [[4.3 Offload Last-mile Transformation to Asynchronous Data Pipeline]] 优化训练过程

  • [[4.4 Prefetch Dataset to GPU]] 预取数据到 GPU

Experiments

  • [[5.1. Incremental Learning]]

    • 两个场景增量学习效果
  • [[5.2 Feed Ranking]]

    • 通过 replay metric 评估 3 中策略的效果
  • [[5.3 Jobs Recommendations]]

  • 5.4 Ads CTR

    • 效果 #card
      • 基线 GDMix model

image.png

6 Deployment Lessons

  • 6.1 Scaling up Feed Training Data Generation

    • 没太看明白,感觉是优化性能实现用 100% sessions 进行训练
  • 6.2 Model Convergence

    • [[DCNv2]] 初始训练不收敛 #card
      • 对数值输入特征应用批量归一化,在当前训练步数下存在欠拟合,但是增加训练步数会导致实验速度下降。

      • 增加 warm-up steps 稳定训练,且可以使用三倍学习率


A Survey of Transformers

[[PTM]] pre-train-model

17 年 Google 发表论文 「Attention is all you need」提出 Transformers 框架,之后一大批人在此基础上进行研究和应用。原始 Transformer 改进的变体被称为 「X-formers」。

X-formers 改进方向有三个:

  • Model Efficiency

    • self-attetion 带来的计算量和参数量(内存)

      • sparse attention 轻量级注意力机制方案

      • divide-and-conquer methods 分治方法

  • Model Generalization

    • 框架灵活,对数据没有太多的结构偏置

    • 训练需要数据量大

    • structural bias or regularization, pre-training on large-scale unlabeled data

  • Model Adaptation

    • 将 Transformer 应用到具体的下游任务中。

背景知识

模型使用形式

  • Encoder-Decoder

  • Encoder only

    • classification or sequence labeling
  • Decoder only

    • sequence generation

      • language modeling

根据对原始 Transformer 的改进分类:architecture modification, pre-training, and applications

  • architecture modification

    • Module Level

      • [[Attention]]

        • 挑战

          • 计算复杂度,受序列长度影响

          • Structural prior 没有结构先验,在小数据集上容易过拟合

        • Sparse Attention

          • token i 和 j 有关系的情况下计算 attention,以稀疏矩阵形式保存

          • 如何定义关系

            • position-based

              • 计算指定位置之间的 attention

              • atomic sparse attention

                • Global

                • Band

                • Dilated

              • 组合 atomic attention 得到更加复杂的attention计算规则

            • content-based

              • Routing Transformer

                • Efficient Content-Based Sparse Attention with RoutingTransformers

                • 聚类

              • [[Reformer]]

                • 使用 LSH,同一个分桶内的 token 计算 attention

        • [[Linearized Attention]]

          • QKV 计算 Attention 的复杂度是 $$O(T^2D)$$,通过引入核函数降低到 $$O(TD)$$

          • key components

            • kernel feature map

              • Performer 用其他函数去拟合 attention 函数

                • FAVOR+ Fast Attention Via positive Orthogonal Random features approach

            • aggregation rule

        • Query Prototyping and Memory Compression

        • Low-rank Self-Attention

          • attention 矩阵线性相关的行数 A 远小于输入 T

          • Low-rank Parameterization

          • Low-rank Approximation

        • Attention with Prior

          • 分成 generated attention 和 prior attention 两部分,下面的方法都是生成 prior attention 尝试

          • Prior that Models locality

            • 文本之类的数据对位置敏感,使用 i 和 j 的位置,结合[[Normal Distribution]]计算先验信息

          • Prior from Lower Modules

            • 使用之前层的注意力分布结果

            • Realformer 将 [[ResNet]] 结构放到 Attention 矩阵中

            • Lazyformer 每两层计算一次 Attention 矩阵

          • Prior as Multi-task Adapters

            • 多任务适配器,看起来是在共享参数

          • Attention with Only Prior

            • 只使用先验
        • Improved Multi-Head Mechanism

          • Head Behavior Modeling

          • Multi-head with Restricted Spans

            • 观察到原始中部分 head 关注局部,部分关注全局

            • 限制 attention 的范围(通过距离实现)

            • decoder 中 mask-multi-head 就是这个思路

          • Multi-head with Refined Aggregation

            • 多头的结果如何合并

            • routing methods

          • Other Modifications

            • Shazeer multi-query attention 所有头之间共享 kv

            • Bhojanapalli 灵活设置 head size

      • OTHER MODULE-LEVEL MODIFICATIONS

        • Position Representations

          • Transformer 具有排列不变性,需要而外位置信息

          • Absolute Position Representations

            • 正余弦编码

            • 位置向量

          • Relative Position Representations.

            • token 之间的关系更加重要

            • 将 embedding 加到 key 的attention中

            • Transformer-XL

          • Other Representations

            • TUPE

              • 混合相对和绝对位置
            • Roformer

              • 旋转位置编码

              • 线性 attention 中实现相对位置编码

            • Position Representations without Explicit Encoding 不要编码

              • R-Transformer 先过 RNN 再将输出结果输入到多头

              • CPE 使用卷积

            • Position Representation on Transformer Decoders

              • 移除 pe
        • LayerNorm

          • Placement of Layer Normalization

            • post-LN

            • pre-LN 保证 skip 链接路上没有其他操作

          • Substitutes of Layer Normalization

            • 可学习参数效果不好,

            • AdaNorm

            • scaled l2 normalization

            • PowerNorm

          • Normalization-free Transformer

            • ReZero 可学习残差模块替代 LN
        • FFN

          • Activation Function in FFN

            • [[Swish]]

            • [[GPT]] [[GELU]]

          • Adapting FFN for Larger Capacity

            • product-key memory layers

            • MoE

              • 取 top k 专家

              • 取最大专家

              • 分组取各自 top1

          • Dropping FFN Layers

            • 简化网络

    • Arch. Level

      • Adapting Transformer to Be Lightweight

        • Lite Transformer

        • Funnel Transformer

          • hidden sequence pooling and up-sampling
        • DeLighT

          • DeLighT block
      • Strengthening Cross-Block Connectivity

        • 针对 decoder 解决问题

        • Transparent Attention

        • Feedback Transformer

          • 使用前一步所有层的信息
      • [[Adaptive Computation Time]]

        • 解决之前模型中层数固定

        • 三种方法

          • [[Universal Transformers]]dynamic halting

            • 达到停止条件的 token 不再改变
          • CCT

            • 跳层
      • Transformers with Divide-and-Conquer Strategies

        • 将 LM 任务中长文本拆分成多个片段

        • Recurrent Transformers 上一个 T 输出信息输入到下一个输入

          • Transformer-XL 上一个输出和下一个输入 concat 在一起

        • Hierarchical Transformers 多个结果聚合

          • Hierarchical for long sequence inputs

            • sentence Transformer and document Transformer
          • Hierarchical for richer representations 更丰富的表示

            • 字母级别表示和词级别表示
      • Exploring Alternative Architecture

        • NAS
    • PRE-TRAINED TRANSFORMERS

    • APPLICATIONS OF TRANSFORMER

      • CV

        • [[ViT]]
    • CONCLUSION AND FUTURE DIRECTIONS

      • 理论分析

      • 更好全局交互机制

      • 处理多种类数据的框架

[[Layer Normalization]]


BERT

[[@BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]]

代码:google-research/bert: TensorFlow code and pre-trained models for BERT

大模型 + 微调提升小任务的效果

输入层

  • 词嵌入(token embedding)、位置嵌入(position embedding)段嵌入(segment embedding)

    • 预训练任务包含判断 segment A 和 segment B 之间的关系
  • 模型结构 12 层,每层 12 个 multi-head

  • CLS 句子开头,最后的输出 emb 对应整句信息

    • 无语义信息的符号会更公平地融合文本中各个词的语义信息,从而更好的表示整句话的语义
  • SEP 句子之间分割

BERT

  • L=12 H=768 A=12, Total Parameters=110M

  • L=24 H=1024 A=16, Total Parameters=340M

两种 NLP 预训练

    1. 产出产品,例如 word2evc 的 embedding
    1. 做为骨架接新结构

[[ELMo]]

  • 使用 LSTM 预测下一个单词

[[GPT]]

  • Transformer

  • 单向

-w1304

贡献性

  • 双向信息重要性

模型输入:

    1. Token emb
    1. Segment emb(A B) 针对 QA 或者两个句子的任务
    1. Position emb

训练方式

  • [[Masked-Language Modeling]] :->mask 部分单词,80 % mask,10 % 错误单词, 10% 正确单词

    • 目的 :-> 训练模型记忆句子之间的关系。
      • 减轻预训练和 fine-tune 目标不一致给模型带来的影响
  • [[Next Sentence Prediction]] :-> 预测是不是下一个句子

    • 句子 A 和句子 B 有 50% 的概率是上下文

    • 解决后续什么问题 :-> QA 和自然语言推理
      image.png
      occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTczNDYxNjMzODQyMV8wLnBuZyI6eyJjb25maWciOnt9LCJlbGVtZW50cyI6W3sibGVmdCI6MzY3LjEzMDExNTk3NDg1MjYsInRvcCI6NTkuNDE3NTUwMDkwNDM3Mzk1LCJ3aWR0aCI6NjIzLjU5MTg3MjM5Mzc2MDksImhlaWdodCI6MTE4LjgzNTEwMDE4MDg3NDc2LCJhbmdsZSI6MCwiY0lkIjoxfSx7ImxlZnQiOjEwODEuOTAzNDAxNTY2MDE5LCJ0b3AiOjY1LjA2OTA2NDM1NDU0MTcsIndpZHRoIjo2NjUuMjAzOTI0MjY0NDI2LCJoZWlnaHQiOjkwLjM5NTU1NzkzMTAxMTA3LCJhbmdsZSI6MCwiY0lkIjoyfV19fQ==
      [[激活函数]] [[GELU]]

  • 和 [[GPT]] 一致,为什么?

优化器

  • 不完整版 adam

  • fine tune 时可能不稳定,需要换成正常版 adam

fine tune

  • 根据任务调整输入和增加预测结构,使用相关数据训练

  • 使用 fine tune 比将bert做为特征放到模型中效果要好

    1. 双句分类
    1. 单句分类
    • CLS 后接 softmax
    1. 预测一个 start 和 end embedding,然后和 T 计算 softmax 取概率最大的做为开始和结束的位置
    1. 实体标注

研究取不同的 embedding 效果

缺陷

  • 不擅长生成类任务(机器翻译、文本摘要)

[[Ref]]


DART

主要思想 :-> 每次新加的树要拟合并不是之前全部树 ensemble 后的负梯度,而是随机抽取一些树 ensmeble 后的负梯度。

  • 解决 GBDT over-specialization 问题

    • 问题现象 :-> 前面迭代树对预测值的贡献比较大,后面的树会集中预测一小部分样本的偏差
    • 常规方法 :-> Shrinkage
      算法流程图
      image.png
  • S1 :-> 训练数据集

  • T1 :-> 使用 S1 数据训练得到决策树

  • 针对决策树 2 到 N #card #incremental

    • 从 M 中随机抽取决策树集合 D,M^\hat{M} 是 M 和 D 的差集

    • 利用 M^\hat{M} 计算样本负梯度,得到数据集 St

    • 利用 St 训练 Tt

    • 调整 Tt 的权重

      • 负梯度只有 M^\hat{M} 树得到,实际上这个少的负梯度由 Tt 和 D 中的树共同拟合,所以需要对 T_t 缩小 D+1 倍
    • 调整 D 中其他树的权重

[[lightgbm 使用记录]] Early stopping is not available in dart mode


DCN

在 Wide & Deep 基础上,对 Wide 部分进行改进。LR 无法进行特征交叉,FM 受限于性能一般只去做二阶交叉,Cross 可以实现高阶交叉。DCN 和 DNN 相比,相同效果情况下可以减少参数量。

Cross 网络只能处理定长的输入,[[ETA]] v4 中无法使用……

特征处理和常规的模型一样,Sparse feature 经过 embedding 处理,然后和 Dense feature concat 在一起。由于 Cross network 每一层的大小都和输入向量大小相等,如果 Sparse feature 不处理,输入维度会很大,然后参数量会增加。

Cross 和 Deep 的输出结果 concat 后过一个 LR 直接输出。

Cross Network

  • 每一层都是由 x0x_0 和前一层的输出 xlx_l 交叉学习残叉。

  • ResNet 的引入可以将网络做的更深。

  • 特点:有限高阶、自动叉乘、参数共享

  • xl+1=x0xlTwl+bl+xl=f(xl,wl,bl)+xl\mathbf{x}_{l+1}=\mathbf{x}_{0} \mathbf{x}_{l}^{T} \mathbf{w}_{l}+\mathbf{b}_{l}+\mathbf{x}_{l}=f\left(\mathbf{x}_{l}, \mathbf{w}_{l}, \mathbf{b}_{l}\right)+\mathbf{x}_{l}

  • 图中可以看到随着层数的增加,参数 w 会线性增加。

Deep NetWork

  • Cross Network 中的参数量太少,不能学习高维的非线性特征。

Analysis

  • Cross 的设计最后包含了每个特征的从 1 阶到高阶的特征组合。与 FM 不同每个特征组合的参数部分共享,所以能降低参数量,比 FM 有更好的泛化性和鲁棒性。比如 FM 可以解决 xi 和 xj 没有同时出现过的情况,但是 Cross 能处理 xi 和 xj 都没有出现过的情况 ……


MoCo

Summary

  • 填补 CV 领域有监督学习和无监督学习的差距

Abstract

  • dictionary look-up

    • 提出基于队列+动量对比用于无监督的表征学习。

Introduction

  • 创新点:用队列表示字典

    • 什么样的字典才适合对比学习?

      • (i) large

        • 从连续高维空间做更多的采样,字典 key 越多,表示的信息越丰富

        • 字典小,key 少,模型泛化能力弱

      • (ii) consistent

        • 字典中的 key 应该用相同或相似的编码器生成

        • 如果key是使用不同编码器得到的,查询时可能找到与 query 使用相同或相似编码器生成的key,而不是语义上相似的 key

  • 无监督在 CV 领域不成功的原因

    • 原始信号空间的不同

    • NLP 原始信号是离散的,词、词根、词缀,容易构建 tokenized dictionaries 做无监督学习

      • tokenized: 把一个词对应成某一个特征

      • Why tokenized dictionaries 有助于无监督学习?

      • 把字典的 key 认为是一个类别,有类似标签的信息帮助学习

      • NLP 无监督学习很容易建模,建好的模型也好优化

    • CV 原始信号是连续的、高维的,不像单词具有浓缩好的、简洁的语义信息,不适合构建一个字典

    • 如果没有字典,无监督学习很难建模

  • 无监督学习主要的两个部分:

    • pretext tasks 代理任务

      • 学习更好的特征表示

      • 常见代理任务

        • denoising auto-encoders 重建整张图

        • context auto-encoders 重建某个 patch

        • cross-channel auto-encoders (colorization) 给图片上色当自监督信号

        • pseudo-labels 图片生成伪标签

        • exemplar image 给同一张图片做不同的数据增广,它们都属于同一个类。

        • patch ordering 九宫格方法:打乱了以后预测 patch 的顺序, or 随机选一个 patch 预测方位 eight positions

        • 利用视频的顺序做 tracking

        • 做聚类的方法 clustering features

    • loss functions

      • 衡量模型预测结果和固定目标的差异

      • L1 or L2 Loss

        • Auto-encoder
      • 判别式网络

        • eight position

          • 图片 9 等分,判断选出的图片位于中间图片的什么方向。
      • 对比学习损失:目标不固定,训练过程中不断改变

      • 对抗学习损失:衡量两个概率分布之间的差异

Method

  • 代理任务 instance discrimination 个体判别

+ 一张图片 $$x_{i}$$ 经过翻转+裁剪等方法得到 $$x_{i1}$$ 和 $$x_{i2}$$,这两张图片做为正样本,其他图片做为负样本。

+ $$x_{i1}$$ 是  anchor

+ $$x_{i2}$$ 是 positive

+ 编码器 E11 和 E12 可以相同,也可以不同。

+ 对比学习怎么做?

  + f11 和 f12 接近,和其他样本远离

    + matching key and dissimilar to others

    + Learning is formulated as minimizing a contrastive loss

  + f11 当成是 query,在字典中查询接近的 key
  • 如何构建大 + 一致的字典

    • 基于队列的字典

      • 摆脱 batch size 的限制

      • 用队列大小限制字典大小

    • 基于动量的编码器

      • θkmθk+(1m)θq\theta_{\mathrm{k}} \leftarrow m \theta_{\mathrm{k}}+(1-m) \theta_{\mathrm{q}}

      • momentum encoder 由当前的编码器初始化得到

      • 动量 m=0.999 比较大是,动量编码器更新缓慢。尽可能保证队列的 key 由相似的编码器生成

  • 和之前方法对比

    • end-to-end 牺牲大

      • 负样本大小等于 batch size 大小
    • memory-bank 牺牲一致性

      • 采样得到负样本
    • moco

      • encoder 基于梯度更新

      • momentum encoder 基于 encoder 进行动量更新

  • 目标函数 [[InfoNCE]]

    • Lq=logexp(qk+/τ)i=0Kexp(qki/τ)\mathcal{L}_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)}

    • 计算 q 和 k 的点积判断两个样本之间的相似度

    • \tau$$ 控制分布图形,越大越关注困难样本

Experiments

  • 7 个检测 + 分割任务

  • linear protocol

    • 预训练模型,应用时只改变最后的全连接层。

    • backbone 做为特征提取器

    • 对比预训练的效果好不好

Conclusion

  • 1000 倍数据增加,moco 性能提升不高

  • 尝试NLP中其他代理任务 masked auto-encoding

    • [[Masked-Language Modeling]]

    • 见 [[Masked Autoencoders Are Scalable Vision Learners]]

  • 总结

    • 去构造一个大的字典,从而让正负样本能够更有效地去对比,提供一个稳定的自监督信号,最后去训练这个模型

NFM

y^NFM(x)\hat{y}_{N F M}(\mathbf{x}) :-> w0+i=1nwixi+f(x)w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+f(\mathbf{x})

  • 第一项和第二项是线性回归

  • 引入第三项神经网络学习 :-> 数据之间的高阶特征

    • 网络输入 :-> FM 模型的二阶特征交叉结果
    • 与直接使用高阶 FM 模型相比 :-> 可以降低模型的训练复杂度,加快训练速度。
      NFM 的神经网络部分包含 4 层,分别是 Embedding Layer、Bi-Interaction Layer、Hidden Layers、Prediction Score。


tags:: #[[Model Architecture]]

Embedding Layer 层对输入的稀疏数据进行 Embedding 操作。最常见的 Embedding 操作是在一张权值表中进行 lookup ,论文中作者强调他们这一步会将 Input Feture Vector 中的值与 Embedding 向量相乘。

Bi-Interaction Layer 层是这篇论文的创新,对 embedding 之后的特征两两之间做 element-wise product,并将结果相加得到一个 k 维(Embeding 大小)向量。这一步相当于对特征的二阶交叉,与 FM 类似,这个公式也能进行化简:

  • fBI(Vx)=i=1nj=i+1nxivixjvj=12[(i=1nxivi)2i=1n(xivi)2]f_{B I}\left(\mathcal{V}_{x}\right)=\sum_{i=1}^{n} \sum_{j=i+1}^{n} x_{i} \mathbf{v}_{i} \odot x_{j} \mathbf{v}_{j} =\frac{1}{2}\left[\left(\sum_{i=1}^{n} x_{i} \mathbf{v}_{i}\right)^{2}-\sum_{i=1}^{n}\left(x_{i} \mathbf{v}_{i}\right)^{2}\right]
    Hidden Layers 层利用常规的 DNN 学习高阶特征交叉

Prdiction Layer 层输出最终的结果:

实验结果: ![](https://media.xiang578.com/15643059963915.jpg) tags:: #HOFM


ResNet

单纯堆积卷积层,并不能让模型表现的更好。

vanishing/exploding gradients

离输入近的网络层会产生梯度消失现象,比较难训练,接到靠近输出的层。
使用 Residual Block

Deep Residual Learning for Image Recognition

  • 学习 residual mapping 比 original unreferenced mapping 轻松

  • identity mapping 给模型提供 shortcuts,如果 block 前后输入输出大小不同,可以通过 w 参数转化

  • 在加法之后过第二个非线性单元

  • bottleneck architectures

  • 为了解决层数变多时,参数数量增加问题。通过 bottleneck 结构,减少维持和左边相同的参数量,然后网络变成 3 层

[[Identity Mappings in Deep Residual Networks]]

[[Residual Networks Behave Like Ensembles of Relatively Shallow Network]]

[[ResNet/Question]]

[[Ref]]


TCN

  • TCN 中输入和输出可能有不同的宽度,c 图表示使用 11 卷积调整输入大小

    • 也可以直接通过 zero padding 来增加 channels

TCN = 1D FCN + causal convolutions

特点

  • 使用因果卷积,不会泄漏未来信息。

    • 论文中强调和 RNN 之类方法进行对比,所以要考虑因果。
  • 可以取任意长度的序列,并将其映射到相同长度的输出序列。

  • 引入 [[ResNet]] 和扩张卷积的组合可以将网络做深以及增加感受野。

细节

  • tcn 中没有 pooling 层

  • normalization 方法是 weight norm,更适合序列问题

增加感受野的方法

  • 更大的 kernel_size (增加参数,卷积核大效果差,卷积核过大会退化成一个全连接层)

  • [[空洞卷积]]

时序问题

    1. 输入和输出矩阵大小相同
    1. 不能使用没有发生时刻的信息,因果卷积

[[ETA 模型]] 实现

  • tf.nn.conv1d(input, filters, stride, padding, data_format='NWC', dilations=None, name=None)

@A Consumer Compensation System in Ride-hailing Service

[[Attachments]]

代驾和货运的补贴系统

  • 价格弹性建模 a transfer learning enhanced uplift modeling is designed to measure the elasticity
    ls-type:: annotation
    hl-page:: 1
    hl-color:: yellow

  • 预算分配 a model predictive control based optimization is formulated to control the budget accurately
    ls-type:: annotation
    hl-page:: 1
    hl-color:: yellow

系统目标:在预算范围内,通过补贴最大化平台收入。

  • Given a total compensation budget or an average compensation rate, find an optimal policy to subsidize queries so that the overall revenue is maximized.
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

难点

  • 如何用历史数据建模用户弹性 Consumer elasticity
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

  • 个保法下公平原则(不同用户相同 odt 补贴相同) Consumer fairness
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

  • 如何建模线上随机的发单请求 Randomness in queries:
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

Transfer Learning Enhanced Uplift Modeling
ls-type:: annotation
hl-page:: 2
hl-color:: yellow

  • 常规训练 uplift 模型需要大量随机补贴下的响应数据(成本高),本文方法使用大量线上观测数据(有偏,受线上策略影响)和少量随机补贴数据训练模型。

  • DNN + GBDT:解决 tabular input space and transfer learning
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

    • 超过 90% 特征是 dense numerical feature ,需要用 GBDT建模,但是 GBDT 不好 fine-tuning 新数据以及处理稀疏特征。

    • 训练 s-learner model

      • [:span]
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow

      • 两个 XGB 模型分别用观测数据 observational data 和随机数据 RCT data 训练,目标是二分类(用户是否下单)。

      • 数据过两个 XGB 模型得到叶子信息,再过 embedding 层,concat 两个 embedding 过 inner 层。

        • 先用 observational data 训练整个网络 Massive observational data is first fed into both inputs to pre-train the model
          ls-type:: annotation
          hl-page:: 3
          hl-color:: red

        • RCT data 用另外一个输出层训练 RCT data is used to fine-tune using a different output layer
          ls-type:: annotation
          hl-page:: 3
          hl-color:: red

        • fine-tuning 时使用 early stopping

Optimization Formulation
ls-type:: annotation
hl-page:: 3
hl-color:: yellow

  • 订单聚类成 OD 网格

    • 网格内历史订单平均弹性作为网格弹性 use the mean of the historical query-wise elasticity to forecast the class-wise elasticity
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow
  • model predictive control (MPC) technology
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow
    建模成最优化问题求解分配方案

线上系统:离线生成补贴词典供线上使用

  • [:span]
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow

离线实验

  • Uplift 模型

    • 特征

      • The features include query information (e.g., the origin, destination, time, weekday, and distance), spatial features (e.g., point of interest information, and order statistics in the same cells), subsidy information, and trading features (e.g., historical order placement rate). I
        ls-type:: annotation
        hl-page:: 4
        hl-color:: yellow
    • 模型细节

      • ob data xgb,35 棵树,1120 个叶子节点

      • rct xgb,51 棵树,1314 叶子节点

      • embedding size 8

      • The size of the common inner layers and output layer is set to 128, 64, and 32
        ls-type:: annotation
        hl-page:: 4
        hl-color:: green

    • 结果分析

      • T-XGB+DNN AUUC 效果比 S-XGB+DNN 效果好,说明需要两棵树去提取特征?

        • S-XGB+DNN:a single GBDT distiller DNN

        • T-XGB+DNN:two-distiller GBDT distiller DNN

      • [:span]
        ls-type:: annotation
        hl-page:: 4
        hl-color:: yellow

  • 优化结果评估

    • 假设 uplift 模型结果是真值,评估不同分配策略的影响。

    • No Cluster Oracle
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow
      不对订单聚类,考虑用户特征。

    • Open Loop 用前 14 天数据预测后 7 天

    • 新系统补贴率低但是更高利润 Compared with the baseline, our system obtains a lower subsidy rate and higher revenue, for its accurate compensation, to achieve a higher ROI.
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow

image.png

一些问题?

  • 为什么不是常规构建 uplift 模型的方法(实验组 + 空白对照组)?

  • T-XGB 和 S-XGB 具体怎么训练?

  • 为什么 rct 树的数量比 ob 树多?从样本角度 ob 树样本更多

  • uplift 没有给纯 xgb 的


@DuETA: Traffic Congestion Propagation Pattern Modeling via Efficient Graph Learning for ETA Prediction at Baidu Maps

[[Attachments]]

核心贡献

  • 新颖性:通过 route-aware graph transformer 捕捉拥堵敏感图中长距离相关性,建模拥堵传播模式

    • The design of DuETA is driven by the novel ideas that directly capture the long-distance correlations through a congestion-sensitive graph, and that model traffic congestion propagation patterns via a route-aware graph transformer.
      ls-type:: annotation
      hl-page:: 2
      hl-color:: yellow

    • 捕捉任意两个( 距离很远,但是在路况状态上很相关的)segment 之间的交互

      • These designs enable DuETA to capture the interactions between any two road segment pairs that are spatially distant but highly correlated with traffic conditions.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: yellow
  • 通过学习交通拥堵传播模式可以有效提高 ETA 预测效果

    • traffic congestion propagation patterns
      ls-type:: annotation
      hl-page:: 2
      hl-color:: yellow

核心问题

  • 业务需求

    • 预测的未来路况状态和真实状态不一致会导致 ETA 误差传播 we observed that a propagation of ETA errors arises from the sharp inconsistency between the predicted traffic condition in the future and ground truth.
      ls-type:: annotation
      hl-page:: 2
      hl-color:: green

    • 建模 traffic congestion propagation patter

      • Traffic congestion propagation pattern modeling is challenging, and it requires accounting for impact regions over time and cumulative effect of delay variations over time caused by traffic events on the road network.
        ls-type:: annotation
        hl-page:: 1
        hl-color:: red

      • 当前交通拥堵路段会影响路网上相邻道路的通行能力 As illustrated in it, the impact regions and cumulative delays over time caused by traffic congestion (the road segments in red) would inevitably affect all the interdependent segments on the road network.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: green

        • 用户请求 ETA 时,只有 3-hop 拥堵,但是由于拥堵传播,等用户到达 target 时,2-hop 拥堵,部分y1-hop 缓行

        • 之前使用 [[STGNN]] 类方法建模直接相邻的路段 existing studies have applied spatial-temporal graph neural networks (STGNNs)[7 , 8 , 21 , 34, 35 , 38 ] to model traffic conditions
          ls-type:: annotation
          hl-page:: 2
          hl-color:: blue
          存在两个问题

          • 没有直接建模路网上不相邻 segment 的远距离相关性,网络传播过程中会有信息损失 The long-distance correlations of indirectly connected road segments are not explicitly modeled, which inevitably suffer from information loss during the multi-step message passing.
            ls-type:: annotation
            hl-page:: 2
            hl-color:: blue

          • 由于 STGCNN 方法计算的复杂度,大部分时候补数很少。两个距离较远的 segment 的路况状态特征不能很好传递。 Traffic conditions are not sufficiently transmitted between two road segments that are spatially distant, because they typically execute only a few steps of message passing (one step in most cases), due to the computational complexity of STGNNs.
            ls-type:: annotation
            hl-page:: 2
            hl-color:: blue

        • [:span]
          ls-type:: annotation
          hl-page:: 2
          hl-color:: green

  • 面临挑战

    • ETA 任务需要建模 contextual and predictive factors, such as spatial-temporal interaction, driving behavior, and traffic congestion propagation inference
      ls-type:: annotation
      hl-page:: 1
      hl-color:: green

    • 路网中新 segment 和 未知区域 we plan to investigate the transferability of our model to deal with unseen road segments or regions.
      ls-type:: annotation
      hl-page:: 9
      hl-color:: yellow

    • 路线旁边 poi 的影响 Second, given the observation that the travel times of some routes have a considerable correlation with the POIs distributed along the roads.
      ls-type:: annotation
      hl-page:: 9
      hl-color:: yellow

      • 特定地点特定时间

      • poi 密集区域对 eta 预测影响 To address this issue, we plan to utilize the POI retrieval system [5, 11, 13] as an auxiliary tool to forecast which POIs would be densely populated and how extensively they would affect the ETA prediction.
        ls-type:: annotation
        hl-page:: 9
        hl-color:: yellow

      • TODO 待找 poi 相关

        • MetaLearned Spatial-Temporal POI Auto-Completion for the Search Engine at Baidu Maps.

        • Personalized Prefix Embedding for POI Auto-Completion in the Search Engine of Baidu Maps

        • HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps

相关工作

  • ETA 任务方法

    • segment-based methods

      • computationally efficient and scalable
        ls-type:: annotation
        hl-page:: 9
        hl-color:: blue

      • do not account for the information of the travel route
        ls-type:: annotation
        hl-page:: 9
        hl-color:: blue

    • end-to-end methods

      • 之间方法对 拥堵传播建模不够 most existing methods are inefficient for modeling the traffic congestion propagation patterns along the route.
        ls-type:: annotation
        hl-page:: 9
        hl-color:: blue
  • STGCNN [[Traffic Flow Forecasting]]

    • 提升 GNN 层数感受野增加太多 increasing the depth of a GNN often means exponential expansion of the neighbor scope
      ls-type:: annotation
      hl-page:: 9
      hl-color:: blue

    • 子图 properly extracted subgraph
      ls-type:: annotation
      hl-page:: 9
      hl-color:: blue

  • [[@ConSTGAT: Contextual Spatial-Temporal Graph Attention Network for Travel Time Estimation at Baidu Maps]] 建模时空关系

  • [[@SSML: Self-Supervised Meta-Learner for En Route Travel Time Estimation at Baidu Maps]] 建模驾驶员行为

解决方法

  • traffic conditions 是动态特征

    • 过去 1 小时路况特征,每 5 分钟一个分桶,共 12 个 he traffic conditions of the past one hour are collected as features, which are divided into 12 time slots (5 minutes per time slot)
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

    • median speed, max speed, min speed, mean speed, and record counts as features
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

  • Congestion-sensitive Graph GCS=(L,{Erf}r=15,Eh)\mathcal{G}^{C S}=\left(\mathcal{L},\left.\left\{\mathcal{E}_r^f\right\}\right|_{r=1} ^5, \mathcal{E}^h\right)

    • we construct a congestion-sensitive graph based on the correlations of traffic patterns.
      ls-type:: annotation
      hl-page:: 2
      hl-color:: yellow

    • 对于某一个 link 找一阶相邻 link 以及高阶相邻link(可能和当前 link 的路况状态有关系)

      • we take advantage of the first-order neighbor links, as well as the high-order neighbor links whose traffic patterns are highly correlated to that of link 𝑙
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow
    • First-order Neighbors
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

      • [[ConSTGAT]] 不同相邻 link 对于当前 link 的影响

        • 当前 link 的路况状态可能受下游影响大于上游 the traffic congestion is more likely to propagate from downstream links to upstream links.
          ls-type:: annotation
          hl-page:: 3
          hl-color:: blue
      • 具体过程

        • 定义多种 link 之间关系,并在建图中考虑这些关系 define multiple types of link relations and incorporate these relations into the construction of the congestion-sensitive graph
          ls-type:: annotation
          hl-page:: 3
          hl-color:: yellow

        • 用 attention 分别处理各种关系捕捉影响 use attention mechanism separately for each relation to capture the impact of neighbor links,
          ls-type:: annotation
          hl-page:: 3
          hl-color:: yellow

        • 用 edge 描述两个 link 之间的关系,一共有 5 种类型

          • [:span]
            ls-type:: annotation
            hl-page:: 3
            hl-color:: yellow

          • An edge describes the relation between two links
            ls-type:: annotation
            hl-page:: 3
            hl-color:: yellow

          • 2 是上游 link

          • 3 是下游 link

          • 剩余三种 link 不在路线中,但是这些 link 的路况状态可能影响目标 link(车辆阻塞路口)

    • High-order Neighbors
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

      • 间接连接 link 也很重要 the long-distance associations between indirectly connected links are also crucial for ETA prediction
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow

      • 如何从高阶邻居采样?

        • link 从 historical route 从取 2-hop 到 5-hop 的邻居 link

        • 计算 link 和邻居 link 的 Pearson correlation
          ls-type:: annotation
          hl-page:: 4
          hl-color:: yellow
          ci,jr=cov(T1,T2)ρT1ρT2c^r_{i,j} = \frac{\operatorname{cov}\left(T_1, T_2\right)}{\rho_{T_1} \rho_{T_2}}

          • 取 link 过去 2 小时,每 5 分钟的平均通过时间序列 T1=[t10,t11,,t123]T_1=\left[t_1^0, t_1^1, \cdots, t_1^{23}\right]T2=[t20,t21,,t223]T_2=\left[t_2^0, t_2^1, \cdots, t_2^{23}\right]
        • 累加同一个 link pair 在不同 route 上的相关系数得到 ci,jfinalc^{final}_{i,j}

        • 每个 link 取相关系数 top5 的邻居 link

      • 连接 link 和 high-order neighbor links high-order edge is defined as an edge that connects a link and one of its high-order neighbor links.
        ls-type:: annotation
        hl-page:: 4
        hl-color:: yellow

      • [:span]
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow

  • [[Graph Transformer]] Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification

    • 多头学习 edge 的权重 t adopts the multi-head attention mechanism [ 23] to learn edge weights.
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow

      • 对于每个 edge 计算 attention score

        • qc,i=WcQxi+bcQ,kc,j=WcKxj+bcK,vc,j=WcVxj+bcV,αc,i,j=qc,i,kc,jkN(i)qc,i,kc,k,\begin{gathered}\mathbf{q}_{c, i}=\mathbf{W}_c^Q \mathbf{x}_i+\mathbf{b}_c^Q, \\ \mathbf{k}_{c, j}=\mathbf{W}_c^K \mathbf{x}_j+\mathbf{b}_c^K, \\ \mathbf{v}_{c, j}=\mathbf{W}_c^V \mathbf{x}_j+\mathbf{b}_c^V, \\ \alpha_{c, i, j}=\frac{\left\langle\mathbf{q}_{c, i}, \mathbf{k}_{c, j}\right\rangle}{\sum_{k \in \mathcal{N}(i)}\left\langle\mathbf{q}_{c, i}, \mathbf{k}_{c, k}\right\rangle},\end{gathered}
      • 计算 link i 的表示

        • hi=xi+1Cc=1CjN(i)αc,i,jvc,j\mathbf{h}_i=\mathbf{x}_i+\frac{1}{C} \sum_{c=1}^C \sum_{j \in \mathcal{N}(i)} \alpha_{c, i, j} \mathbf{v}_{c, j}
    • resnet 解决 [[GNN]] 的 oversmoothing 问题 t addresses the oversmoothing problem in vanilla GNNs by residual connections.
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow

  • route-aware graph transformer

    • [:span]
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow

tags:: #[[Model Architecture]] #[[Graph Transformer]]

+ 重新构建的图$\mathcal{G}^{C S}=\left(\mathcal{L},\left.\left\{\mathcal{E}_r\right\}\right|_{r=1} ^6\right)$有六种类型的边,拆分成六张子图,每一张子图用一个 transformer

  + $\mathbf{h}_i=\mathbf{x}_i+\frac{1}{6 C} \sum_{r=1}^6 \sum_{c=1}^C \sum_{j \in \mathcal{N}_r(i)} \alpha_{c, i, j}^{(r)} \mathbf{v}_{c, j}^{(r)}$

+ 之前的特征 transformer 无法区分一个 link 是否在路线上,无法生成不同的表示

  + the graph transformer is unable to identify whether a link belongs to a given route or not

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

  + 5a 中路线不同,但是 a 通过 transformer 学习到表示可能相同

  + [:span]

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

+ route-aware structural encoding

  + position encoding

    + 与当前 link 的最近距离

    + encode the order information of a link.

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

    + 控制控制依赖这条 link 路况的程度 be regarded as a gate to control the degree of dependency of the traffic condition when a user requests the ETA.

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

  + route identifier

    + 表示当前 link 是否在路线上

+ Integration

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

  + a 1-D convolution layer (Conv1D) with a window size of 3

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

  + MLP+ReLU 预估每条 link 的 travel time,累加得到整个行程的 eta

    + $\left[\hat{y}_1, \hat{y}_2, \cdots, \hat{y}_m\right]=\operatorname{MLP}\left(\operatorname{Conv1D}\left(\left[\mathbf{h}_1, \mathbf{h}_2, \cdots, \mathbf{h}_m\right]\right)\right)$

  + [\[\[Multi-Task Learning\]\]](/post/logseq/Multi-Task%20Learning.html) 优化 link 级别和路线级别 eta Multi-task learning is adopted to optimize the model parameters from both the link-level and the route-level.

ls-type:: annotation
hl-page:: 5
hl-color:: yellow

    + link-level loss function [[Huber Loss]]

      + $L_{l i n k}\left(\hat{y}_i, y_i\right)= \begin{cases}\frac{1}{2}\left(\hat{y}_i-y_i\right)^2, & \left|\hat{y}_i-y_i\right|<\delta, \\ \delta\left(\left|\hat{y}_i-y_i\right|-\frac{1}{\delta}\right), & \text { otherwise }\end{cases}$

    + route-level loss function

      + $L_{\text {route }}(\hat{y}, y)=\frac{|\hat{y}-y|}{y}$

    + 最终 loss

      + $L=\frac{1}{n} \sum_i^n\left(\frac{1}{m^{(i)}} \sum_{j=1}^{m^{(i)}} L_{\text {link }}\left(\hat{y}_j^{(i)}, y_j^{(i)}\right)+L_{\text {route }}\left(\hat{y}^{(i)}, y^{(i)}\right)\right)$

实验结论

  • 实验数据

    • 行程 link 数和 didi 差不多

    • 2021.10.10-2021.11.20,5周训练,1周测试

  • 指标

    • mae rmse mape
  • Baseline

    • AVG 请求时刻路况速度

    • STANN

      • STGNN、attention+LSTM

      • 只考虑相邻 link

    • [[DCRNN]]

      • GCN 处理 spatial info

      • LSTM 处理 temporal info

    • DeepTravel

      • bidirectional LSTM
    • [[ConSTGAT]]

    • DuETA

      • the embedding size and the hidden size of DuETA to be 32
        ls-type:: annotation
        hl-page:: 6
        hl-color:: green

      • attention heads 𝐶 is set to be 8
        ls-type:: annotation
        hl-page:: 6
        hl-color:: green

      • Adam

      • 3e-5

  • 结果

    • DeepTravel 和 ConSTGAT

      • End-to-end methods are more effective than the segment-based methods
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow

      • the correlations of spatial and temporal information are jointly modeled.
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow

    • DuETA

      • 对远距离拥堵更加敏感 sensitive to long-distance traffic congestion,
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow

      • 处理交通事件带来的影响 On the other hand, the cumulative effect of delay variations over time caused by traffic events on the road network can be alleviated by the high efficiency of traffic congestion pattern modeling.
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow

    • [:span]
      ls-type:: annotation
      hl-page:: 6
      hl-color:: yellow

  • Ablation Studies
    ls-type:: annotation
    hl-page:: 7
    hl-color:: yellow

    • 主要组件对比

      • removing both components hurts performance significantly in all three cities.
        ls-type:: annotation
        hl-page:: 7
        hl-color:: yellow

      • [:span]
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow

  • route identifier 效果 To obtain an understanding of the effect of the route identifier, we visualize the distributions of the attention weights in Figure 7.
    ls-type:: annotation
    hl-page:: 7
    hl-color:: yellow

    • w/o 组 Off route 的 attention weight 比 Complete 组大,加上这个模块模型能更关注路线上的 link

      • enables DuETA to pay more attention to the links that are in the travel routes.
        ls-type:: annotation
        hl-page:: 7
        hl-color:: yellow
    • [:span]
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

  • congestion-sensitive graph
    ls-type:: annotation
    hl-page:: 7
    hl-color:: yellow

    • 存在部分远处 link 相关性比附近 link 强 he average Pearson correlation coefficients of our selected high-order neighbors is much higher than those of the second-order and third-order neighbors
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

    • [:span]
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

    • 远处拥堵 case 提升效果大 examine the relative improvements of high-order neighbors in cases of traffic congestion2 and normal traffic.
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

      • [:span]
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow

  • Practical Applicability
    ls-type:: annotation
    hl-page:: 8
    hl-color:: yellow

    • P 发生拥堵,dueta 能预测未来路线上会堵

    • [:span]
      ls-type:: annotation
      hl-page:: 8
      hl-color:: yellow

  • Online Evaluation
    ls-type:: annotation
    hl-page:: 8
    hl-color:: yellow

    • 2022.4.12-2022.4.18

    • 全部、长短单、平峰和高峰

    • 在线 RMSE 高 averaged RMSE scores in the online evaluation of DuETA are higher than those in the offline evaluation
      ls-type:: annotation
      hl-page:: 8
      hl-color:: green

      • 数据噪音大
    • [:span]
      ls-type:: annotation
      hl-page:: 9
      hl-color:: yellow

读后总结

  • 只考虑下游关系,tcn 之类的考虑上游真的有用吗

@DeeprETA: An ETA Post-processing System at Scale

[[Abstract]]

  • Estimated Time of Arrival (ETA) plays an important role in delivery and ride-hailing platforms. For example, Uber uses ETAs to calculate fares, estimate pickup times, match riders to drivers, plan deliveries, and more.

  • Commonly used route planning algorithms predict an ETA conditioned on the best available route, but such ETA estimates can be unreliable when the actual route taken is not known in advance.

    • route eta 无法解决偏航问题
  • In this paper, we describe an ETA post-processing system in which a deep residual ETA network (DeeprETA) refines naive ETAs produced by a route planning algorithm.

  • Offline experiments and online tests demonstrate that post-processing by DeeprETA significantly improves upon the accuracy of naive ETAs as measured by mean and median absolute error. We further show that post-processing by DeeprETA attains lower error than competitive baseline regression models.

[[Attachments]]


@Pyraformer: Low-Complexity Pyramidal Attention for Long-Range Time Series Modeling and Forecasting

[[Abstract]]

  • Accurate prediction of the future given the past based on time series data is of paramount importance, since it opens the door for decision making and risk management ahead of time. In practice, the challenge is to build a flexible but parsimonious model that can capture a wide range of temporal dependencies.

  • In this paper, we propose Pyraformer by exploring the [[multiresolution representation]] of the time series.

    • Specifically, we introduce the [[pyramidal attention module]] (PAM) in which

      • the inter-scale tree structure summarizes features at different resolutions

      • and the intra-scale neighboring connections model the temporal dependencies of different ranges.

    • Under mild conditions, the maximum length of the signal traversing path in Pyraformer is a constant (i.e., O(1)\mathcal O(1)) with regard to the sequence length LL, while its time and space complexity scale linearly with LL.

  • Extensive numerical results show that Pyraformer typically achieves the highest prediction accuracy in both single-step and long-range forecasting tasks with the least amount of time and memory consumption, especially when the sequence is long.

[[Summary]]


@Transformers in Time Series: A Survey

[[Abstract]]

  • Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interests in the time series community.

  • Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications.

    • In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series transformers in two perspectives.

      • From the perspective of network structure, we summarize the adaptations and modification that have been made to transformer in order to accommodate the challenges in time series analysis.

      • From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classification.

    • Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how transformers perform in time series.

    • Finally, we discuss and suggest future directions to provide useful research guidance.

  • A corresponding resource list which will be continuously updated can be found in the GitHub repository1.

[[Attachments]]

Input Encoding and Positional Encoding

  • Absolute Positional Encoding

  • Relative Positional Encoding

  • Hybrid positional encodings

Network Modifications for Time Series

  + [[LogTrans]] [ Li et al., 2019 ] and [\[\[Pyraformer\]\]](/post/logseq/%40Pyraformer%3A%20Low-Complexity%20Pyramidal%20Attention%20for%20Long-Range%20Time%20Series%20Modeling%20and%20Forecasting.html) explicitly introducing a sparsity bias

  + 移除 self-attention 矩阵部分值 [\[\[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting\]\]](/post/logseq/%40Informer%3A%20Beyond%20Efficient%20Transformer%20for%20Long%20Sequence%20Time-Series%20Forecasting.html) [[FEDformer]]
  • Architecture Level

    • renovate transformer

    • hierarchical architecture 分层结构

Applications of Time Series Transformers

  • Forecasting

    • Time Series Forecasting

      • [[LogTrans]]

        • proposed convolutional self-attention by employing causal convolutions to generate queries and keys in the self-attention layer 因果卷积引入子注意力计算

        • a Logsparse mask

      • [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]

      • AST [[Adversarial sparse transformer for time series forecasting]]

        • 使用生成对抗编码器-解码器架训练用于时间序列预测的稀疏 Transformer 模型

        • 对抗训练可以直接塑造网络的输出来改善预测效果,避免逐步预测带来的累积误差

          • directly shaping the output distribution of network to avoid the error accumulation through one-step ahead inference
      • [[Autoformer]]

        • simple seasonaltrend decomposition architecture 简单季节性趋势分解架构

        • an auto-correlation mechanism working as an attention module 自相关机制注意力模块 O(LlogL)O(L\log L)

          • measures the time-delay similarity between inputs signal and aggregate the top-k similar sub-series to produce the output
      • [[FEDformer]]

        • 利用 [[Fourier transform]] 和 [[Wavelet transform]] 处理 frequency domain 频域中的注意力操作

          • linear complexity
      • [[@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]]

        • multi-horizon forecasting model with static covariate encoders, gating feature selection and temporal self-attention decoder
      • [[SSDNet]] [[ProTran]]

        • combine Transformer with state space models to provide probabilistic forecasts 提供概率预测
      • [[Pyraformer]]

        • hierarchical pyramidal attention module with binary tree following path

        • 分层金字塔注意力模块,二叉树

      • [[Aliformer]]

        • Knowledge-guided attention
    • Spatio-Temporal Forecasting [[Traffic Flow Forecasting]]

      • Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

        • self attention module to capture temporal-temporal dependencies 时序特征

        • Graph neural network module to capture spatial dependencies 空间特征

      • Spatialtemporal Transformer

        • 空间 transformer 辅助图卷积网络来捕获空间依赖关系
      • Spatio-temporal graph Transformer

        • 基于注意力的图卷积机制
    • Event Forecasting

      • temporal point processes (TPP)
  • Anomaly Detection

  • Classification

    • [[GTN]]

Experimental Evaluation and Discussion

模型鲁棒性、模型大小以及对时序季节性和趋势捕捉能力

  • robustness analysis, model size analysis, and

seasonal-trend decomposition analysis

  • seasonal-trend decomposition 是 transformer 解决时序预测的重要组成部分

  • 所有模型加上 moving average trend decomposition architecture proposed 结构后,和原始模型相比效果都获得提升

Future Research Opportunities

  • [[inductive bias]] for Time Series Transformers

    • 避免过拟合,训练 transformer 需要大量数据。

    • 时序数据具有 seasonal/periodic and trend patterns

    • 将对于时序数据模型的理解和特定任务的特征做为归纳偏置引入 transformer

  • [[GNN]]

    • 增强对于空间依赖和多维度之间的关系建模能力
  • [[预训练]]

    • 目前针对时间序列的预训练 transformer 集中在时序分类任务中
  • [[Neural architecture search]]

    • 如果构建高效的 transformer 结构

Ref


@Autoformer: Decomposition Transformers with Auto-Correlation for Long-Term Series Forecasting

[[Attachments]]

关键信息

核心贡献

  • Autoformer as a novel decomposition architecture with an Auto-Correlation mechanism
    ls-type:: annotation
    hl-page:: 1
    hl-color:: yellow
    [[Auto-Correlation Mechanism]] 自相关机制,代替点向连接的注意力机制,实现序列级连接和较低复杂度

    • 序列级别依赖发现以及表示聚合 conducts the dependencies discovery and representation aggregation at the sub-series level.
      ls-type:: annotation
      hl-page:: 1
      hl-color:: yellow
  • Decomposition Architecture 深度分解架构,从复杂时间模式种分解出可预测性更强的成分

    • 推理复杂时间模式 intricate temporal patterns
      ls-type:: annotation
      hl-page:: 2
      hl-color:: yellow

      • process the complex time series and extract more predictable components.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: yellow

      • 常规对前向数据进行分解,忽视未来可能发生的分解组件之间的潜在交互作用

        • This common usage limits the capabilities of decomposition and overlooks the potential future interactions among decomposed components.
          ls-type:: annotation
          hl-page:: 2
          hl-color:: blue
      • 分解可以解开纠缠的时间模式并突出时间序列的固有属性 can ravel out the entangled temporal patterns and highlight the inherent properties of time series
        ls-type:: annotation
        hl-page:: 2
        hl-color:: blue

      • 对子序列进行分解,基于时间序列周期性导出的过程相似性构建一种系列级连接 sub-series at the same phase position among periods often present similar temporal processes
        ls-type:: annotation
        hl-page:: 2
        hl-color:: yellow

      • 逐步分解整个预测过程中的隐藏序列,包括过去的序列和预测的中间结果

        • decompose the hidden series throughout the whole forecasting process, including both the past series and the predicted intermediate results.
          ls-type:: annotation
          hl-page:: 3
          hl-color:: yellow

核心问题

  • 原始序列中各种趋势信息比较混乱,无法在长时间序列中发现时间依赖

    • unreliable to discover the temporal dependencies directly from the long-term time series because the dependencies can be obscured by entangled temporal patterns.
      ls-type:: annotation
      hl-page:: 1
      hl-color:: green

    • 需要处理复杂的时间模式,打破计算效率和信息利用的瓶颈 handling intricate temporal patterns and breaking the bottleneck of computation efficiency and information utilization.
      ls-type:: annotation
      hl-page:: 3
      hl-color:: blue

    • 待预测序列长度远远大于输入长度

  • Transformer 平方级别复杂度

    • 之前方法尝试使用稀疏 self-attention mproving self-attention to a sparse version
      ls-type:: annotation
      hl-page:: 1
      hl-color:: green

      • 稀疏注意力机制将造成信息的丢失,成为长时间序列预测的瓶颈
    • these models still utilize the point-wise representation aggregation
      ls-type:: annotation
      hl-page:: 1
      hl-color:: blue

  • Transformer point-wise 维度聚合

    • self-attention 来捕捉时刻间的依赖

      • 难以直接发现可靠的时间依赖

相关工作

  • 之前方法集中在 recurrent connections, temporal attention or causal convolution.
    ls-type:: annotation
    hl-page:: 2
    hl-color:: blue

    • [[DeepAR]] 自回归 + RNN 建模未来序列的概率分布

      • combines autoregressive methods and RNNs to model the probabilistic distribution of future series.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: blue
    • [[LSTNet]] CNNs + ResNet 捕捉 short-term 和 long-term temporal patterns

    • [[TCN]]

  • Transformer 类方法

    • [[Reformer]] [[local-sensitive hashing attention]] #mark/paper

    • [[Informer]] KL + [[ProbSparse Attention]]

  • Decomposition of Time Series

    • 将原始时间序列分解成多个序列,新序列更容易预测

      • each representing one of the underlying categories of patterns that are more predictable.
        ls-type:: annotation
        hl-page:: 3
        hl-color:: blue
    • [[Prophet]] with trend-seasonality decomposition

    • [[N-BEATS]] with basis expansion

    • [[DeepGLO]] with matrix decomposition

    • 缺点

      • 简单分解限制

        • limited by the plain decomposition effect of historical series
          ls-type:: annotation
          hl-page:: 3
          hl-color:: yellow
      • 忽视层次交互

        • overlooks the hierarchical interaction between the underlying patterns of series in the long-term future.
          ls-type:: annotation
          hl-page:: 3
          hl-color:: yellow

        • 预测问题未来的不可知性,通常方法先对过去序列进行分解,再分别预测,这会造成预测结果受限于分解效果,并且忽视了未来各个组分之间的相互作用。

解决方法

  • Decomposition Architecture
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow

    • [:span]
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow

tags:: #[[Model Architecture]] [[Encoder-Decoder]]

+ series decomposition block

ls-type:: annotation
hl-page:: 3
hl-color:: yellow
保留周期部分

  + 序列分解成趋势项和周期项部分 separate the series into trend-cyclical and seasonal parts.

ls-type:: annotation
hl-page:: 3
hl-color:: yellow

  + 在预测过程中,模型交替进行预测结果优化和序列分解,从隐藏变量中逐步分离趋势项与周期项

  + 逐步从预测的中间隐藏变量中提取长期稳定的趋势 xtract the long-term stationary trend from predicted intermediate hidden variables progressively. 

ls-type:: annotation
hl-page:: 3
hl-color:: yellow

  + 使用 [[Moving Average]] 平滑周期性、突出趋势项 smooth out periodic fluctuations and highlight the long-term trends

ls-type:: annotation
hl-page:: 3
hl-color:: yellow

    + $\mathcal{X}_{\mathrm{s}}, \mathcal{X}_{\mathrm{t}}=\operatorname{SeriesDecomp}(\mathcal{X})$

      + $\begin{aligned} & \mathcal{X}_{\mathrm{t}}=\operatorname{Avg} \operatorname{Pool}(\operatorname{Padding}(\mathcal{X})) \\ & \mathcal{X}_{\mathrm{s}}=\mathcal{X}-\mathcal{X}_{\mathrm{t}}\end{aligned}$

      + xs seasonal

      + xt trend-cyclial part

+ [[Encoder]]

  + Encoder 输入过去 I 步 $\mathcal{X}_{\mathrm{en}} \in \mathbb{R}^{I \times d}$

  + 建模周期性部分,逐步消除趋势项(在 decoder 中通过累积得到) focuses on the seasonal part modeling

ls-type:: annotation
hl-page:: 4
hl-color:: yellow

    + 当成 decoder 的交叉信息 be used as the cross information to help the decoder refine prediction results

ls-type:: annotation
hl-page:: 4
hl-color:: yellow

  + 流程

    + _ the eliminated trend part

ls-type:: annotation
hl-page:: 4
hl-color:: red
消除趋势项

    + $\begin{aligned} & \mathcal{S}_{\text {en }}^{l, 1},_{-}=\operatorname{SeriesDecomp}\left(\text { Auto-Correlation }\left(\mathcal{X}_{\text {en }}^{l-1}\right)+\mathcal{X}_{\text {en }}^{l-1}\right) \\ & \mathcal{S}_{\text {en }}^{l, 2},_{-}=\operatorname{SeriesDecomp}\left(\text { FeedForward }\left(\mathcal{S}_{\text {en }}^{l, 1}\right)+\mathcal{S}_{\text {en }}^{l, 1}\right)\end{aligned}$

+ [[Decoder]]  分解对趋势项与周期项建模

  + 一半过去信息 + 填充

    + $\begin{aligned} \mathcal{X}_{\text {ens }}, \mathcal{X}_{\text {ent }} & =\operatorname{SeriesDecomp}\left(\mathcal{X}_{\text {en } \frac{I}{2}: I}\right) \\ \mathcal{X}_{\text {des }} & =\operatorname{Concat}\left(\mathcal{X}_{\text {ens }}, \mathcal{X}_0\right) \\ \mathcal{X}_{\text {det }} & =\operatorname{Concat}\left(\mathcal{X}_{\text {ent }}, \mathcal{X}_{\text {Mean }}\right),\end{aligned}$

    + seasonal part $\mathcal{X}_{\mathrm{des}} \in \mathbb{R}^{\left(\frac{1}{2}+O\right) \times d}$

    + trend-cyclical part $\mathcal{X}_{\mathrm{det}} \in \mathbb{R}^{\left(\frac{1}{2}+O\right) \times d}$

  + 趋势-周期累积结构 the accumulation structure for trend-cyclical components

ls-type:: annotation
hl-page:: 4
hl-color:: yellow

    + 从中间隐变量提取潜在趋势,使得模型可以逐步改进趋势预测并且消除干扰信息,以便于在自相关性中发现基于周期的依赖关系。

    + 其中,对于周期项,自相关机制利用序列的周期性质,聚合不同周期中具有相似过程的子序列;

    + Note that the model extracts the potential trend from the intermediate hidden variables during the decoder, allowing Autoformer to progressively refine the trend prediction and eliminate interference information for period-based dependencies discovery in Auto-Correlation. 

ls-type:: annotation
hl-page:: 4
hl-color:: red

  + the stacked Auto-Correlation mechanism for seasonal component

ls-type:: annotation
hl-page:: 4
hl-color:: yellow

  + 流程

    + $\begin{aligned} \mathcal{S}_{\mathrm{de}}^{l, 1}, \mathcal{T}_{\mathrm{de}}^{l, 1} & =\operatorname{SeriesDecomp}\left(\text { Auto-Correlation }\left(\mathcal{X}_{\mathrm{de}}^{l-1}\right)+\mathcal{X}_{\mathrm{de}}^{l-1}\right) \\ \mathcal{S}_{\mathrm{de}}^{l, 2}, \mathcal{T}_{\mathrm{de}}^{l, 2} & =\operatorname{SeriesDecomp}\left(\text { Auto-Correlation }\left(\mathcal{S}_{\mathrm{de}}^{l, 1}, \mathcal{X}_{\mathrm{en}}^N\right)+\mathcal{S}_{\mathrm{de}}^{l, 1}\right) \\ \mathcal{S}_{\mathrm{de}}^{l, 3}, \mathcal{T}_{\mathrm{de}}^{l, 3} & =\operatorname{SeriesDecomp}\left(\text { FeedForward }\left(\mathcal{S}_{\mathrm{de}}^{l, 2}\right)+\mathcal{S}_{\mathrm{de}}^{l, 2}\right) \end{aligned}$

    + 趋势项,通过累积的方式逐步从预测的隐变量中提取出趋势信息

      + ${\mathcal{T}_{\mathrm{de}}^l =\mathcal{T}_{\mathrm{de}}^{l-1}+\mathcal{W}_{l, 1} * \mathcal{T}_{\mathrm{de}}^{l, 1}+\mathcal{W}_{l, 2} * \mathcal{T}_{\mathrm{de}}^{l, 2}+\mathcal{W}_{l, 3} * \mathcal{T}_{\mathrm{de}}^{l, 3}}$
  • [[Auto-Correlation Mechanism]]

    • [:span]
      ls-type:: annotation
      hl-page:: 5
      hl-color:: yellow

    • 高效的序列级连接,从而扩展信息效用

    • Period-based dependencies 基于周期的依赖发现

      • 不同周期相同相位之间通常表现出相似的子过程 same phase position among periods naturally provides similar sub-processes.
        ls-type:: annotation
        hl-page:: 5
        hl-color:: yellow

      • [[Stochastic process theory]] discrete-time process 的 [[autocorrelation]]

        • RXX(τ)=limL1Lt=1LXtXtτ\mathcal{R}_{\mathcal{X} \mathcal{X}}(\tau)=\lim _{L \rightarrow \infty} \frac{1}{L} \sum_{t=1}^L \mathcal{X}_t \mathcal{X}_{t-\tau}

        • RXX(τ)\mathcal{R}_{\mathcal{X} \mathcal{X}}(\tau) 代表序列 {Xt}\{ \mathcal{X}_t \}τ\tau 延迟 {Xtτ}\{ \mathcal{X}_{t - \tau} \} 之间的相似性

        • 将这种时延相似性看作未归一化的周期预估的置信度,即周期长度为 \tau 的置信度为 R(τ)\mathcal{R}(\tau)

          • 假设周期为 \tau, Xτ:L1\mathcal{X}_{\tau: L-1}X0:Lτ1\mathcal{X}_{0: L-\tau-1} 会极为相似
      • 取最相关 k 个长度 choose the most possible k period lengths
        ls-type:: annotation
        hl-page:: 5
        hl-color:: yellow

    • Time delay aggregation 时延信息聚合

      • 该部分聚合组序列 oll the series based on selected time delay
        ls-type:: annotation
        hl-page:: 5
        hl-color:: yellow

      • 相似的子序列信息进行聚合

      • 流程

        • 计算 top k=clogL 个长度

          • τ1,,τk=argTopkτ{1,,L}(RQ,K(τ))\tau_1, \cdots, \tau_k=\underset{\tau \in\{1, \cdots, L\}}{\arg \operatorname{Topk}}\left(\mathcal{R}_{\mathcal{Q}, \mathcal{K}}(\tau)\right) \\
        • 计算长度后计算相关性,然后求 softmax

          • R^Q,K(τ1),,R^Q,K(τk)=SoftMax(RQ,K(τ1),,RQ,K(τk))\widehat{\mathcal{R}}_{\mathcal{Q}, \mathcal{K}}\left(\tau_1\right), \cdots, \widehat{\mathcal{R}}_{\mathcal{Q}, \mathcal{K}}\left(\tau_k\right)=\operatorname{SoftMax}\left(\mathcal{R}_{\mathcal{Q}, \mathcal{K}}\left(\tau_1\right), \cdots, \mathcal{R}_{\mathcal{Q}, \mathcal{K}}\left(\tau_k\right)\right) \\
        • Roll 进行信息对齐,X0:Lτ1\mathcal{X}_{0: L-\tau-1} 移到序列最前面,X0:Lτ1\mathcal{X}_{0: L-\tau-1}Xτ:L1\mathcal{X}_{\tau: L-1} 保存着相似的趋势信息

          • during which elements that are shifted beyond the first position are re-introduced at the last position
            ls-type:: annotation
            hl-page:: 5
            hl-color:: red

          •  Auto-Correlation (Q,K,V)=i=1kRoll(V,τi)R^Q,K(τi)\begin{aligned}\text { Auto-Correlation }(\mathcal{Q}, \mathcal{K}, \mathcal{V})=\sum_{i=1}^k \operatorname{Roll}\left(\mathcal{V}, \tau_i\right) \widehat{\mathcal{R}}_{\mathcal{Q}, \mathcal{K}}\left(\tau_i\right) \end{aligned}

      • 多头

        •  MultiHead (Q,K,V)=Woutput  Concat (head1,, head h) where  head i= Auto-Correlation (Qi,Ki,Vi)\begin{aligned} \text { MultiHead }(\mathcal{Q}, \mathcal{K}, \mathcal{V}) & =\mathcal{W}_{\text {output }} * \text { Concat }\left(\operatorname{head}_1, \cdots, \text { head }_h\right) \\ \text { where } \text { head }_i & =\text { Auto-Correlation }\left(\mathcal{Q}_i, \mathcal{K}_i, \mathcal{V}_i\right)\end{aligned}
      • 复杂度 O(LlogL)\mathcal{O}(L \log L)

        • 计算 τ[1,L)\tau \in [1, L) 的相关性

        • Wiener-Khinchin 理论,自相关信息可以使用[[快速傅里叶变换]] Fast Fourier Transforms
          ls-type:: annotation
          hl-page:: 6
          hl-color:: yellow
          得到

    • 与其他方法对比

      • [:span]
        ls-type:: annotation
        hl-page:: 6
        hl-color:: yellow
    • 序列级高效连接

    • self-attention family only calculates the relation between scattered points
      ls-type:: annotation
      hl-page:: 6
      hl-color:: blue

    • 我们采用时间延迟块来聚合底层周期中相似的子序列。 we adopt the time delay block to aggregate the similar sub-series from underlying periods.
      ls-type:: annotation
      hl-page:: 6
      hl-color:: yellow

实验结论

  • 参数设置

    • ADAM + early stopped

    • Autoformer contains 2 encoder layers and 1 decoder layer.
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

  • 对比

    • Informer [ 48 ], Reformer [23 ], LogTrans [26 ], two RNN-based models: LSTNet [ 25], LSTM [ 17] and CNN-based TCN [ 4] as baselines.
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

    • N-BEATS[ 29 ], DeepAR [34 ], Prophet [ 39 ] and ARMIA
      ls-type:: annotation
      hl-page:: 7
      hl-color:: yellow

  • 实验结果

    • 预测方式 前 96 预测后 96

      • we fix the input length and evaluate models with a wide range of prediction lengths: 96, 192, 336, 720.
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow
    • [[multivariate]]

      • 训练变长预测表现变化也平稳 we can also find that the performance of Autoformer changes quite steadily as the prediction length O increases
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow

      • [:span]
        ls-type:: annotation
        hl-page:: 7
        hl-color:: yellow

    • Univariate results
      ls-type:: annotation
      hl-page:: 8
      hl-color:: yellow
      单变量

      • . This situation of ARIMA can be benefited from its inherent capacity for non-stationary economic data but is limited by the intricate temporal patterns of real-world series.
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow

      • [:span]
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow

  • [[Ablation Study]]

    • Decomposition architecture
      ls-type:: annotation
      hl-page:: 8
      hl-color:: yellow

      • 具有较好的通用性,其他模型加分解结构效果有提升,预测时效的延长,效果提升更明显

        • 减少复杂模式引起的干扰 our method can generalize to other models and release the capacity of other dependencies learning mechanisms, alleviate the distraction caused by intricate patterns
          ls-type:: annotation
          hl-page:: 9
          hl-color:: yellow
      • 对比深度分解架构和先分解再使用两个模型预测的方式,后者参数多,但是表现不好。

      • [:span]
        ls-type:: annotation
        hl-page:: 8
        hl-color:: yellow

    • Auto-Correlation vs. self-attention family
      ls-type:: annotation
      hl-page:: 9
      hl-color:: yellow

      • 效果超过 full attention,序列级别建模带来的收益

      • 可以预测更长序列

      • [:span]
        ls-type:: annotation
        hl-page:: 9
        hl-color:: yellow

  • Model Analysis
    ls-type:: annotation
    hl-page:: 9
    hl-color:: yellow

    • time series decomposition

      • 随着序列分解单元的数量增加,模型学到的趋势项会越来月接近数据的正式结果,周期项可以更好的捕捉序列变化情况。

      • [:span]
        ls-type:: annotation
        hl-page:: 9
        hl-color:: yellow

    • Dependencies learning

      • 找到的注意力更合理 Autoformer can discover the relevant information more sufficiently and precisely.
        ls-type:: annotation
        hl-page:: 9
        hl-color:: yellow

      • 模型自相关机制可以正确发掘出每个周期的下降过程,没有误识别和漏识别,注意力机制存在错误和漏缺

      • [:span]
        ls-type:: annotation
        hl-page:: 10
        hl-color:: yellow

    • Complex seasonality modeling
      ls-type:: annotation
      hl-page:: 9
      hl-color:: yellow

      • 学习到的长度有意义 Autoformer can capture the complex seasonalities of real-world series from deep representations and further provide a human-interpretable prediction.
        ls-type:: annotation
        hl-page:: 10
        hl-color:: yellow

      • 高的部分说明有对应的周期性

      • [:span]
        ls-type:: annotation
        hl-page:: 10
        hl-color:: yellow

    • Efficiency analysis

读后总结

[[Autoformer Code]]


@ETA Prediction with Graph Neural Networks in Google Maps

[[Abstract]]

  • Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events—such as rush hours—that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney).

[[Attachments]]


@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting

[[Abstract]]

  • Many real-world applications require the prediction of long sequence time-series, such as electricity consumption planning. Long sequence time-series forecasting (LSTF) demands a high prediction capacity of the model, which is the ability to capture precise long-range dependency coupling between output and input efficiently. Recent studies have shown the potential of Transformer to increase the prediction capacity. However, there are several severe issues with Transformer that prevent it from being directly applicable to LSTF, including quadratic time complexity, high memory usage, and inherent limitation of the encoder-decoder architecture. To address these issues, we design an efficient transformer-based model for LSTF, named Informer, with three distinctive characteristics: (i) a ProbSparse self-attention mechanism, which achieves O(L log L) in time complexity and memory usage, and has comparable performance on sequences’ dependency alignment. (ii) the self-attention distilling highlights dominating attention by halving cascading layer input, and efficiently handles extreme long input sequences. (iii) the generative style decoder, while conceptually simple, predicts the long time-series sequences at one forward operation rather than a step-by-step way, which drastically improves the inference speed of long-sequence predictions. Extensive experiments on four large-scale datasets demonstrate that Informer significantly outperforms existing methods and provides a new solution to the LSTF problem.

[[Attachments]]

Long sequence [[Time Series Forecasting]] (LSTF)

  • 长序列时间序列预测需要模型能够有效捕捉输出和输入之间精确的 long-range dependency coupling

LSTF 中使用 [[Transformer]] 需要解决的问题 #card #incremental

  • self-attention 计算复杂度 O(L2)O(L^2)

  • 多层 encoder/decoder 结构内存增长

  • dynamic decoding 方式预测耗时长

网络结构

ProbSparse self-attention

  • 替换 inner product self-attention

    • [[Sparse Transformer]] 结合行输入和列输出

    • [[LogSparse Transformer]] cyclical pattern

    • [[Reformer]] locally-sensitive hashing(LSH) self-attention

    • [[Linformer]]

    • [[Transformer-XL]] [[Compressive Transformer]] use auxiliary hidden states to capture long-range dependency

    • [[Longformer]]

  • 其他优化 self-attention 工作存在的问题

    • 缺少理论分析

    • 对于 multi-head self-attention 每个 head 都采用相同的优化策略

  • self-attention 点积结果服从 long tail distribution

    • 较少点积对贡献绝大部分的注意力得分

    • 现实含义:序列中某个元素一般只会和少数几个元素具有较高的相似性/关联性

  • 第 i 个 query 用 qiq_i 表示

    • A(qi,K,V)=jf(qi,kj)lf(qi,kl)vj=Ep(kjqi)[vj]\mathcal{A}\left(q_i, K, V\right)=\sum_j \frac{f\left(q_i, k_j\right)}{\sum_l f\left(q_i, k_l\right)} v_j=\mathbb{E}_{p\left(k_j \mid q_i\right)}\left[v_j\right]

    • p(kjqi)=k(qi,kj)lf(qi,kl)p\left(k_j \mid q_i\right)=\frac{k\left(q_i, k_j\right)}{\sum_l f\left(q_i, k_l\right)}

    • k(qi,kj)=exp(qikjTd)k\left(q_i, k_j\right)=\exp \left(\frac{q_i k_j^T}{\sqrt{d}}\right)

  • query 稀疏性判断方法

    • p(kjqj)p(k_j|q_j)[[均匀分布]] q 的 [[KL Divergence]]

      • q 是均分分布,相等于每个 key 的概率都是 1L\frac{1}{L}

      • 如果 query 得到的分布类似于均匀分布,每个概率值都趋近于 1L\frac{1}{L},值很小,这样的 query 不会提供什么价值。

      • p 和 q 分布差异越大的结果越是我们需要的 query

      • p 和 q 的顺序和论文中的不同 D(pq)=xp(x)logp(x)q(x)=Ep(x)(logp(x)q(x))D(p \| q)=\sum_{x} p(x) \log \frac{p(x)}{q(x)}=E_{p(x)}\left(\log \frac{p(x)}{q(x)}\right)

    • KL(qp)=lnl=1LkeqiklT/d1Lkj=1LqikjT/dlnLkK L(q \| p)=\ln \sum_{l=1}^{L_k} e^{q_i k_l^T / \sqrt{d}}-\frac{1}{L_k} \sum_{j=1}^L q_i k_j^T / \sqrt{d}-\ln L_k

      • 把公式代入,然后化解

+ $M\left(q_i, K\right)=\ln \sum_{l=1}^{L_k} e^{q_i k_l^T / \sqrt{d}}-\frac{1}{L_k} \sum_{j=1}^{L_k} q_i k_j^T / \sqrt{d}$

  + 第一项是经典难题 log-sum-exp(LSE) 问题

  + 稀疏性度量 $M\left(q_i, K\right)$

    + $\ln L \leq M\left(q_i, K\right) \leq \max _j\left\{\frac{q_i k_j^T}{\sqrt{d}}\right\}-\frac{1}{L} \sum_{j=1}^L\left\{\frac{q_i k_j^T}{\sqrt{d}}\right\}+\ln L$

    + LSE 项用最大值来替代,即用和当前 qi 最近的 kj,所以才有下面取 top N 操作

      + $\bar{M}\left(\mathbf{q}_i, \mathbf{K}\right)=\max _j\left\{\frac{\mathbf{q}_i \mathbf{k}_j^{\top}}{\sqrt{d}}\right\}-\frac{1}{L_K} \sum_{j=1}^{L_K} \frac{\mathbf{q}_i \mathbf{k}_j^{\top}}{\sqrt{d}}$

+ $\mathcal{A}(\mathbf{Q}, \mathbf{K}, \mathbf{V})=\operatorname{Softmax}\left(\frac{\overline{\mathbf{Q}} \mathbf{K}^{\top}}{\sqrt{d}}\right) \mathbf{V}$

  + $\bar{Q}$ 是稀疏矩阵,前 u 个有值

+ 具体流程

  + 为每个 query 都随机采样 N 个 key,默认值是 5lnL

    + 利用点积结果服从长尾分布的假设,计算每个  query 稀疏性得分时,只需要和采样出的部分 key 计算

  + 计算每个 query 的稀疏性得分

  + 选择稀疏性分数最高的 N 个 query,N 默认值是 5lnL

  + 只计算 N 个 query 和所有 key 的点积结果,进而得到 attention 结果

  + 其余 L-N 个 query 不计算,直接将 self-attention 层输入取均值(mean(V))作为输出

    + 除了选中的 N 个query index 对应位置上的输出不同,其他 L-N 个 embedding 都是相同的。所以新的结果存在一部分冗余信息,也是下一步可以使用 maxpooling 的原因

    + 保证每个 ProbSparse self-attention 层的输入和输出序列长度都是 L

+ 将时间和空间复杂度降为 $$O(L_K \log L_Q)$$

+ 如何解决 对于 multi-head self-attention 每个 head 都采用相同的优化策略

现象?

  + 每个 query 随机采样 key 这一步每个 head 的采样结果是相同的

  + 每一层 self-attention 都会先对 QKV 做线性转换,序列中同一个位置不同 head 对应的 query、key 向量不同

  + 最终每个 head 中得到的 N 个稀疏性最高的 query 也是不同的,相当于每个 head 都采取不同的优化策略

Self-attention distilling

  • 突出 dominating score,缩短每一层输入的长度,降低空间复杂度到 O((2ϵ)LlogL)\mathcal{O}((2-\epsilon) L \log L)

  • encoder 层数加深,序列中每个位置的输出已经包含序列中其他元素的信息,所以可以缩短输入序列的长度

    • 过 attention 层后,大部分位置值相同
  • 激活函数 [[ELU]]

  • 通过 Conv1d + max-pooling layer 缩短序列长度

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
class ConvLayer(nn.Module):
def __init__(self, c_in):
super(ConvLayer, self).__init__()
self.downConv = nn.Conv1d(in_channels=c_in,
out_channels=c_in,
kernel_size=3,
padding=2,
padding_mode='circular')
self.norm = nn.BatchNorm1d(c_in)
self.activation = nn.ELU()
self.maxPool = nn.MaxPool1d(kernel_size=3, stride=2, padding=1)

def forward(self, x):
x = self.downConv(x.permute(0, 2, 1))
x = self.norm(x)
x = self.activation(x)
x = self.maxPool(x)
x = x.transpose(1,2)
return x

Generative style decoder

  • 预测阶段通过一次前向得到全部预测结果,避免 dynamic decoding

  • 不论训练还是预测,Decoder 的输入序列分成两部分 Xfeeddcoder=concat(Xtoken,Xplaceholder)X_{feed dcoder} = concat(X_{token}, X_{placeholder})

    • 预测时间点前一段已知序列作为 start token

    • 待预测序列的 placeholder 序列

  • 经过 deocder 后,每个 placeholder 都有一个向量,然后输入到一个全链接层得到预测结果

  • 为什么用 generative style decoder #card

    • 解码器能捕捉任意位置输出和长序列依赖关系

    • 避免累积误差

Experiment

  • Baseline

  • 实验设计

    • Univariate Time-series Forecasting

    • Multivariate Time-series Forecasting

      • LSTnet 是基线模型
    • Ablation Study

Input representation

  • 提供时序信息

  • 不是天级别更新的模型需要 global time stamp

    • week,month,holiday embedding
  • 额外实验

    • 利用 t0-t1 的特征预测 t2-t3 结果还不错

    • 可能是 local time stamp 和 global time stamp 让 informer 不依赖自回归结果还能有不错的预测结果

See Also

Ref


@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting

核心贡献

  • Temporal Fusion Transformer 框架 #card

    • recurrent layers for local processing
      ls-type:: annotation
      hl-page:: 1
      hl-color:: yellow

    • interpretable self-attention layers for long-term dependencie
      ls-type:: annotation
      hl-page:: 1
      hl-color:: yellow

    • specialized components to select relevant features
      ls-type:: annotation
      hl-page:: 1
      hl-color:: yellow

    • a series of gating layers to suppress unnecessary components
      ls-type:: annotation
      hl-page:: 1
      hl-color:: yellow

  • 模型可解释性 interpretable insights into temporal dynamics
    ls-type:: annotation
    hl-page:: 1
    hl-color:: yellow
    #card

    • 区分全局重要特征 globally-important variables for the prediction problem
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

    • 持久的时间模式 persistent temporal patterns
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

    • 显著事件 significant events
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow

核心问题

  • #card [[Multi-horizon Forecasting]] 包含复杂的输入特征组合 contains a complex mix of inputs
    ls-type:: annotation
    hl-page:: 1
    hl-color:: green

    • 静态变量

      • 与时间无关的静态变量 including static (i.e. time-invariant) covariates
        ls-type:: annotation
        hl-page:: 1
        hl-color:: green
    • 时变变量 Time-dependent Inputs

      • 已知未来输入 known future inputs,
        ls-type:: annotation
        hl-page:: 1
        hl-color:: green

        • 未来节假日信息
      • 外生时间序列 exogenous time series that are only observed in the past – without any prior information on how they interact with the target.
        ls-type:: annotation
        hl-page:: 1
        hl-color:: green

        • 历史顾客流量 historical customer foot traffic
          ls-type:: annotation
          hl-page:: 2
          hl-color:: green
    • 相关示意图

      • [:span]
        ls-type:: annotation
        hl-page:: 2
        hl-color:: yellow
  • 使用 attention 机制增强 :-> 选择过去相关特征 used attention-based methods to enhance the selection of relevant time steps in the past
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow

  • 之前基于 DNN 方法的缺陷 #card

    • 没有考虑不同类型输入特征 fail to consider the different types of inputs
      ls-type:: annotation
      hl-page:: 2
      hl-color:: blue

      • 万物皆时序 构建模型时,将所有的特征按 time step 直接 concat 在一起,所有变量全部扩展到所有的时间步,无论是静态、动态的变量都合并在一起送入模型。
    • 假定所有外生输入都已知与未来 assume that all exogenous inputs are known into the future
      ls-type:: annotation
      hl-page:: 2
      hl-color:: blue

    • 忽略重要的静态协变量 neglect important static covariates
      ls-type:: annotation
      hl-page:: 2
      hl-color:: blue

      • 通常处理方法是预测时和其他时间相关特征连接
  • 已有深度学习方法是黑箱,如何解释模型的预测结果?#card

    • do not shed light on how they use the full range of inputs present in practical scenarios
      ls-type:: annotation
      hl-page:: 1
      hl-color:: blue

相关工作

  • [[@A Multi-Horizon Quantile Recurrent Forecaster]] Multi-horizon Quantile Recurrent Forecaster MQRNN 结构,同时预测未来多个时间步的值

  • deep state space 状态空间模型,统计学,hybrid network,类似工作 [[ESRNN]] [[N-BEATS]]

  • [[Explainable AI]]

    • post-hoc methods 事后方法(因果方法),不考虑输入特征的时间顺序 do not consider the time ordering of input features
      ls-type:: annotation
      hl-page:: 3
      hl-color:: blue

    • 基于 attention 的架构对语言或语音序列有很好的解释,但是很难适用于多维度预测 attention-based architectures are proposed with inherent interpretability for sequential data
      ls-type:: annotation
      hl-page:: 3
      hl-color:: blue

解决方法

  • [[Multi-horizon Forecasting]]

    • prediction intervals [[区间预测]] #card

      • [[DeepAR]] 直接修改模型的输出,模型不拟合原始标签,而是拟合人工指定的分布,通过蒙特卡洛采样取平均得到最终的点预测。
    • 分位数回归 [[Quantile Regression]],每一个 time step 输出 10th10^{th} 50th50^{th} 90th90^{th} #card

      • 不同分位数下预测能够产生预测区间,通过区间大小反应预测结果的不确定性。某个点在不同分位数线性回归的预测结果很接近,则预测确定性搞。

      • Quantile Outputs

      • y^i(q,t,τ)=fq(τ,yi,tk:t,zi,tk:t,xi,tk:t+τ,si)\hat{y}_i(q, t, \tau)=f_q\left(\tau, y_{i, t-k: t}, \boldsymbol{z}_{i, t-k: t}, \boldsymbol{x}_{i, t-k: t+\tau}, \boldsymbol{s}_i\right)

      • 设计 [[quantile loss]]

        • L(Ω,W)=ytΩqQτ=1τmaxQL(yt,y^(q,tτ,τ),q)Mτmax\begin{gathered}\mathcal{L}(\Omega, \boldsymbol{W})=\sum_{y_t \in \Omega} \sum_{q \in \mathcal{Q}} \sum_{\tau=1}^{\tau_{\max }} \frac{Q L\left(y_t, \hat{y}(q, t-\tau, \tau), q\right)}{M \tau_{\max }} \end{gathered}

          • QL(y,y^,q)=q(yy^)++(1q)(y^y)+Q L(y, \hat{y}, q)=q(y-\hat{y})_{+}+(1-q)(\hat{y}-y)_{+} #card
            • q 代表分位数

            • ()+=max(0,)(*)_+ = \max (0,*)

            • 假设拟合分位数 0.9

          + $Q L(y, \hat{y}, q=0.9)=\max (0.9 *(y-\hat{y}), 0.1 *(\hat{y}-y))$

          + $y-\hat{y} \gt 0$ 模型预测偏小,Loss 增加更多

          + loss 中权重 9:1,模型倾向预测出大的数字,Loss 下降快

        + 假设拟合分位数 0.5,退化成 MAE

          + $Q L(y, \hat{y}, q=0.5)=\max (0.5 *(y-\hat{y}), 0.5 *(\hat{y}-y)) = 0.5*|y-\hat{y}|$

+ q-Risk 避免不同预测点下的预测量纲不一致问题,对结果做正则化处理。目前只关注 P50 和 P90 两个分位数 #card
  + $q$-Risk $=\frac{2 \sum_{y_t \in \tilde{\Omega}} \sum_{\tau=1}^{\tau_{\max }} Q L\left(y_t, \hat{y}(q, t-\tau, \tau), q\right)}{\sum_{y_t \in \tilde{\Omega}} \sum_{\tau=1}^{\tau_{\max }}\left|y_t\right|}$
  • 模型结构 #card

    • [:span]
      ls-type:: annotation
      hl-page:: 6
      hl-color:: yellow
  • 输入部分

    • [[Static Covariate Encoders]] 通过 GRN 将静态特征编码变成 4 个不同向量

    • 动态特征 #card

      • post inputs

      • known future inputs

    • [[Variable Selection Networks]] 通过选择重要的特征,减少不必要的噪音输入,以提供模型性能。 #card

      • [[GLU]] 灵感来自 LSTM 的门控机制,sigmoid 取值范围 0-1
    • 对不同类型的输入变量应该区别对待 #card

      • 静态变量通过特殊的 [[Static Covariate Encoders]],后续做为 encoder 和 decoder 的输入

      • 过去的动态时变变量+动态时不变变量进入 encoder 结构中(蓝色 variable seletcion)

      • 未来的动态时不变变量进入 decoder 结构中

    • seq2seq with teacher forcing 架构 #card

      • encoder 部分动态特征 embedding 和静态特征 embedding concat 在一起做为输入

        • 静态变量 + 动态时变变量
      • decoder

        • 静态变量 + 动态时不变变量
  • 模型组成

    • [[Gated Residual Network]] 模型能够灵活地仅在需要时应用非线性处理 #card

      • 外生输入和目标之间的确切关系通常是事先未知的,因此很难预见那些变量是相关的。

      • 很难确定非线性处理的程度该多大,并且可能存在更简单的模型就能满足需求。

    • [[Interpretable Multi-Head Attention]]

    • [[Temporal Fusion Decoder]] 学习数据集中的时间关系

    • 通过 dense 层得到多个 [[Quantile Outputs]] #card

      • y^(q,t,τ)=Wqψ~(t,τ)+bq\hat{y}(q, t, \tau)=\boldsymbol{W}_q \tilde{\boldsymbol{\psi}}(t, \tau)+b_q

[[TFT Interpretability Use Cases]] #card

  • 输入特征重要性 examining the importance of each input variable in prediction
    ls-type:: annotation
    hl-page:: 17
    hl-color:: yellow

  • 可视化当前时间模式 visualizing persistent temporal patterns
    ls-type:: annotation
    hl-page:: 17
    hl-color:: yellow

  • 识别导致任何导致时间动态显著变化的时间 identifying any regimes or events that lead to significant changes in temporal dynamics
    ls-type:: annotation
    hl-page:: 17
    hl-color:: yellow

[[Ref]]


@Order Fulfillment Cycle Time Estimation for On-Demand Food Delivery

[[Abstract]]

  • 送达时间 Order Fulfillment Cycle Time (OFCT)

  • a novel post-processing layer

    • accounting for the distributional mismatch between the true OFCT values and those predicted by the model at initialization
  • By providing customers with conveniences such as easy access to an extensive variety of restaurants, effortless food ordering and fast delivery, on-demand food delivery (OFD) platforms have achieved explosive growth in recent years. A crucial machine learning task performed at OFD platforms is prediction of the Order Fulfillment Cycle Time (OFCT), which refers to the amount of time elapsed between a customer places an order and he/she receives the meal. The accuracy of predicted OFCT is important for customer satisfaction, as it needs to be communicated to a customer before he/she places the order, and is considered as a service promise that should be fulfilled as well as possible. As a result, the estimated OFCT also heavily influences planning decisions such as dispatching and routing.

[[Attachments]]

相关工作

  • ETA 分类

    • Route-based

    • OD-based

  • OFCT 比其他 ETA 独特的挑战

    • Significantly more influencing factors

      • location and characteristics of the restaurant

      • the food preparation time

      • the orders assigned to the same courier (which will alter his/her delivery itinerary) 骑手配送行为

    • Unavailability of critical information 预估时缺少关键信息

      • 分配到订单的骑手

      • 最终的路线

  • Time Window Assignment Problem

BackGroound

  • 城市划分成区域,区域内配送站

    • 订单终点对应一个指定的 poi 类型(学校、写字楼等)
  • 订单生命周期划分 PT(下单到取到餐),DT(取餐到送达),OFCT(下单到送达),CT(下单到做好)

  • 流程

    • 用户下单时,预估 OFCT

      • dispatching 分单

      • routing 路径规划

        • VRPTW
    • 骑手到达餐厅取餐

    • 骑手配送

Feature

  • 订单特征

    • Spartial feature 空间特征

      • 起点终点 id 和坐标,city/grid id
    • Temporal feature 时间特征

      • hour of the day

      • workday

    • Order size features 订单大小影响商家出餐速度

      • sku_num

      • price

        • sku_num 低估套餐的大小
  • Aggregation Features 聚合特征

    • 通过手机传感器收集信息(gps,wifi,bluetooth),得到 dt/pt/ofct

    • 对上面的信息进行聚合:20 分位数、平均、标准差、80 分位数

    • 通过一定规则选取订单集合,在集合中再分组计算统计信息。

      • 相同 city/grid/restaurant/hour of day 的订单

      • 发单前不同时长完成订单

  • Dish Features 餐品特征

    • sku 分配 uid,订单可以用包含的 uid 表示

    • 训练分类模型( 116 类),使用预测的分类结果

      • sku 标题做为输入,人工标签

      • fine-tuning 中文 Bert

  • Cooking Time Features

    • 出餐时间很难获取真值,从历史数据中挖掘。

    • 三种场景

      • 骑手到达太早,需要等到出餐完成

      • 骑手需要在同一个餐厅取多个订单,只有在餐厅完成最后一个订单才能离开

      • 骑手达到时,如果餐厅完成出餐可以立即进行配送统计

    • 聚合 departure time from the restaurant 得到 ct

  • Supply-and-Demand Features 供需特征

    • 供需比 demand-to-supply ratio (DSR) DSRo=C1Oouncompleted \mathrm{DSR}_{o}=|C|^{-1}\left|O_{o}^{\text {uncompleted }}\right|

      • 订餐需求在时间和空间上都会有突变,此时通过合并订单配送策略减轻影响。合单策略提升整体配送效率,但是单个订单配送时间变长(骑手需要绕路去取或送)。通过 DSR 来描述这种现象

      • Oouncompleted O_{o}^{\text {uncompleted }} 当前未完成订单,DSR 表示当前未完成订单和活跃配送员数的比例

    • 配送比 dropoff_ratio oo=Oouncompleted 1Oodropoff o_{o}=\left|O_{o}^{\text {uncompleted }}\right|^{-1}\left|O_{o}^{\text {dropoff }}\right|

      • 未完成订单中,已经取餐在配送中的订单占比
    • 餐厅出餐量 unfetched_order_count

      • 骑手未取订单数,值越大需要等待的时间越大
    • 单量预测相关,尝试后发现没用

      • 单量预测模型中使用的特征 OFCT 预估模型也使用,没有带来新的信息增益。
    • 图例

      • e DSR 越大,Bundle 越大,g 对应 OCFT 变大

      • DSR 一定,dropoff ratio 越大,代表快要释放的运力更多,然后bundle变小,ocft 变小。

  • Courier Features 骑手特征

    • 预测 OFCT 时,送单的骑手还没有确定,通过预测模型对附近的骑手进行排序,取 top-ranked 骑手做为派单对象。

    • 相关特征

      • 取单距离

      • work load 负载,骑手当前未完成订单数

      • urgency 紧急程度,骑手需要配送订单平均或最小剩余时间

      • Mutual Distances 订单终点、餐厅、骑手当前订单的终点的最短距离。距离越大,骑手越有可能绕路。

  • ETA-based Drop-off Time Feature 多维度相似订单的配送 ETA

    • 利用回归模型来学习配送段 ETA 无法很好处理长尾不规律 case,比如高峰期等待电梯

      • 利用历史订单作为配送段时间预估的语料

      • 历史订单用多维特征来表示

      • 新订单通过 k 近邻搜索出相似的历史订单

      • 对相似的历史订单真实配送段时间加权平均,作为新订单的预估配送段时间

    • 通过 OD-based ETA 方法(TEMP+R) 预估餐厅到终点的配送时间特征 DTo

      • xo~eta =[lon(ro~), lat (ro~),lon(po~), lat (po~),hr(aσ~creation )]Tw\mathbf{x}_{\widetilde{o}}^{\text {eta }}=\left[\operatorname{lon}\left(r_{\widetilde{o}}\right) \text {, lat }\left(r_{\widetilde{o}}\right), \operatorname{lon}\left(p_{\widetilde{o}}\right) \text {, lat }\left(p_{\widetilde{o}}\right), \operatorname{hr}\left(a_{\widetilde{\sigma}}^{\text {creation }}\right)\right]^{T} \odot w

        • 用经纬度+时间组成的向量表示订单,w 表示向量中每一项的权重
      • dissim(o~,o^)=xo~eta xo^eta 2\operatorname{dissim}(\widetilde{o}, \widehat{o})=\left\|\mathbf{x}_{\widetilde{o}}^{\text {eta }}-x_{\widehat{o}}^{\text {eta }}\right\|_{2}

        • 用 Euclidean distance 表示两个不同订单之间的距离
      • (i=1kwi)1i=1k(wiDToi)\left(\sum_{i=1}^{k} w_{i}\right)^{-1} \sum_{i=1}^{k}\left(w_{i} \cdot \mathrm{DT}_{o_{i}}\right)

        • 在历史订单中取 k 个最相似订单,构建该特征

        • wi=exp(dissim(oi,o)σ2)w_{i}=\exp \left(-\frac{\operatorname{dissim}\left(o_{i}, o\right)}{\sigma^{2}}\right) 代表权重

  • Meteorological Features 气象

    • 数值特征

      • 气温

      • 空气质量

      • 风速

    • 类别特征

      • 天气

Model

  • 网络结构

  • [[Feature Engineering]]

    • 数值特征 归一化+ ELU 升维

      • 维度太低限制神经网络表达
    • 类别特征 embedding + dropout 避免过拟合

      • 一个类别多个id 对应 embedding 进行 average pooling
    • 骑手特征

      • 召回 pickup 距离在 X 米范围内的骑手组成骑手集合

      • re-ranking module Query Invariant Listwise Context Modeling (QILCM)

        • 对于每个骑手 c 生成表示向量 hc 以及排序分 sc,利用加权得分表示骑手向量

          • cCcandidate sch~c\sum_{\forall c \in C^{\text {candidate }}} s_{c} \tilde{\mathbf{h}}_{c}
  • Postprocessing 后处理

    • 模型收敛速度慢,预测效果差

      • 模型拟合的数据分布与真实履约时间的分布发生偏移,真实的履约时间实际上是一个右偏长尾的分布,相当于有一小部分订单真实的配送时间偏长而模型没有学习到。
    • OFCTopredict =μreal +σreal yoout μini σini \mathrm{OFCT}_o^{\text {predict }}=\mu^{\text {real }}+\sigma^{\text {real }} \frac{y_o^{\text {out }}-\mu^{\text {ini }}}{\sigma^{\text {ini }}}

      • 实现拟合和缩放

      • y_out 是模型输出

      • real 是根据真实数据统计的均值和方差

      • ini 是整个训练集的均值和方差

    • 自适应 [[Box-Cox Transformation]] 逆变换

      • 更加能够拟合长尾数据分布
  • Objective Function

    • predict +λrank \ell_{\text {predict }}+\lambda \ell_{\text {rank }}

      • ranking loss + predictive loss
    • predict =OFCTopredict OFCTo\ell_{\text {predict }}=\left|\mathrm{OFCT}_o^{\text {predict }}-\mathrm{OFCT}_o\right|

    • rank =cCcandidate (I(c=co)log(sc))\ell_{\text {rank }}=-\sum_{\forall c \in C^{\text {candidate }}}\left(\mathbb{I}\left(c=c_o\right) \log \left(s_c\right)\right)

      • cross entropy loss

      • I()\mathbb{I}() 是指示函数

Experiments

  • Ablation Analysis of Feature Groups 消融实验

    • 订单特征、供需特征

  • 不同模型对比

    • GBM

    • [[DeepFM]] [[@xDeepFM: Combining Explicit and Implicit Feature Interactions for Recommender Systems]]

      • 简单 concatenation&MLP 策略,不适合 OFCT 任务
    • [[TEMP+R]] [[MURAT]] OD-based ETA methods

Conclusion

  • 可以改进的点

    • cooking time features

    • weather prediction

Ref


@TabNet: Attentive Interpretable Tabular Learning

[[Abstract]]

  • We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant.

[[Attachments]]


@FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction

[[Abstract]]

  • Advertising and feed ranking are essential to many Internet companies such as Facebook and Sina Weibo. Among many real-world advertising and feed ranking systems, click through rate (CTR) prediction plays a central role. There are many proposed models in this field such as logistic regression, tree based models, factorization machine based models and deep learning based CTR models. However, many current works calculate the feature interactions in a simple way such as Hadamard product and inner product and they care less about the importance of features. In this paper, a new model named FiBiNET as an abbreviation for Feature Importance and Bilinear feature Interaction NETwork is proposed to dynamically learn the feature importance and fine-grained feature interactions. On the one hand, the FiBiNET can dynamically learn the importance of features via the Squeeze-Excitation network (SENET) mechanism; on the other hand, it is able to effectively learn the feature interactions via bilinear function. We conduct extensive experiments on two realworld datasets and show that our shallow model outperforms other shallow models such as factorization machine(FM) and field-aware factorization machine(FFM). In order to improve performance further, we combine a classical deep neural network(DNN) component with the shallow model to be a deep model. The deep FiBiNET consistently outperforms the other state-of-the-art deep models such as DeepFM and extreme deep factorization machine(XdeepFM).

[[Attachments]]

创新点 #card

  • 使用 SENET 层对 embedding 进行加权

  • 使用 Bilinear-Interaction Layer 进行特征交叉

背景

  • importance of features and feature interactions
    网络结构 #card
    image.png

[[SENET]] Squeeze-and-Excitation network 特征加权方法

  • Squeeze f*k 压缩成 f*1
  • Excitation 然后通过两层 dnn 变成得到权重 f*1
  • Re-Weight 将结果和原始 f*k 相乘。
  • 原理是想通过控制scale的大小,把重要的特征增强,不重要的特征减弱,从而让提取的特征指向性更强。
    [[Bilinear-Interaction Layer]] :-> 结合Inner Product和Hadamard Product方式,并引入额外参数矩阵W,学习特征交叉。
  • 不同特征之间交叉 v_i * W * v_j 时,权重矩阵来源
    • Field-All Type :-> 全体共享 W
      • 参数量 :-> feature * embedding + emb*emb
    • Field-Each Type :-> 每个 filed 一组 W
      • 参数量 :-> feature * embedding + feature*emb*emb
    • Filed-Interaction Type :-> 不同特征之间有一组 w
      • 参数量 :-> feature * embedding + feature*feature*emb*emb
  • Bilinear-Interaction Layer 对比 [[FFM]] 有效减少参数量
    • FM 参数量 :-> feature * embedding
    • FFM 参数量 :-> feature * filed * embedding
  • 不同特征交叉方式

image.png
occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTcyNDE5ODg5MzM4OV8wLnBuZyI6eyJjb25maWciOnsiaGlkZUFsbFRlc3RPbmUiOnRydWV9LCJlbGVtZW50cyI6W3sibGVmdCI6MjU5LjE3MTgyMTU5NDIzODMsInRvcCI6MjA5LjQ1OTk3ODc0MTI3MzI2LCJ3aWR0aCI6MzY3LjAxMDMwOTg1NTE0MzIsImhlaWdodCI6MjYyLjk5MTYzODUzMTQxNiwiYW5nbGUiOjAsImNJZCI6MX0seyJsZWZ0IjozNjUuMzM4NDg4MjYwOTA0OTcsInRvcCI6NDAxLjQ5OTc1MDg4ODczMzIsIndpZHRoIjoxNjguMDEwMzA5ODU1MTQzMiwiaGVpZ2h0Ijo5OC43MzkxMDI4MjMzOTIyNywiYW5nbGUiOjAsImNJZCI6MX0seyJsZWZ0IjozMTUuNjcxODIxNTk0MjM4MywidG9wIjo1OTkuMTUxNTc0MjkyNDA4Mywid2lkdGgiOjQ4Ny4zNDM2NDMxODg0NzY1NiwiaGVpZ2h0IjoxODMuMDU4MDIwMjA1NjMzNSwiYW5nbGUiOjAsImNJZCI6Mn0seyJsZWZ0Ijo0MzkuMDA1MTU0OTI3NTcxNiwidG9wIjo3NDMuNTA3MDcyODI1OTA0MSwid2lkdGgiOjIzNS45ODk2OTAxNDQ4NTY4LCJoZWlnaHQiOjcyLjM0NTg1NDM0ODE5MiwiYW5nbGUiOjAsImNJZCI6Mn0seyJsZWZ0Ijo4OTEuMTcxODIxNTk0MjM4MywidG9wIjozODMuNTE1NDc3MzgwODMzNTQsIndpZHRoIjo2MjYuMzQzNjQzMTg4NDc2NiwiaGVpZ2h0IjoyODEuMjU4OTg4ODk3MTIyMiwiYW5nbGUiOjAsImNJZCI6M30seyJsZWZ0Ijo5NTkuMTcxODIxNTk0MjM4MywidG9wIjo1NzkuOTYwNDM0NjU4NDgyMywid2lkdGgiOjI5MC4zNDM2NDMxODg0NzY1NiwiaGVpZ2h0Ijo4MS41NTAzODQ5MTgxODgwMiwiYW5nbGUiOjAsImNJZCI6M31dfX0=
[[ETA]] fm 交叉部分可以尝试引入 bi layer,使用 link 状态组合 W。

  • 但是路况状态可能会改变

Ref


@Applying Deep Learning To Airbnb Search

[[Abstract]]

  • The application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were driven by a gradient boosted decision tree model. The gains, however, plateaued over time. This paper discusses the work done in applying neural networks in an attempt to break out of that plateau. We present our perspective not with the intention of pushing the frontier of new modeling techniques. Instead, ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers. Bon voyage!

[[Attachments]]

记录 Airbnb 深度模型探索历程。

业务:顾客查询后返回一个有序的列表(Listing,对应房间)。

深度模型之前使用 GBDT 对房子进行打分。

Model Evolution

  • 评价指标 [[NDCG]]

  • Simple NN

    • 32 层 NN + Relu,特征和优化目标和 GBDT 效果相同

    • 打通深度模型训练和线上预测的 pipeline。

  • LambdaRank NN

    • [[LambdaRank]] 直接优化 NDCG

    • 采用 pairwise 的训练方式,构造 <被预定的房间,未被预定的房间> 的训练样本

    • pairwise loss 乘上 item 对调顺序带来的指标变化 NDCG,关注列表中靠前位置的样本

      • 比如 booked listing 从 2 到 1 的价值比 books listing 从 10 到 9 的意义大。
  • Decision Tree/Factorization Machine NN

+ GBDT 叶子节点位置(类别特征) + FM 预测结果放到 NN 中。
  • Deep NN

    • 10 倍训练数据,195 个特征(离散特征 embedding 化),两层神经网络:127 fc + relu 以及 83 fc + relu。

    • 部分 dnn 的特征是来自其他模型,之间在 [[@Real-time Personalization using Embeddings for Search Ranking at Airbnb]] 里面提到的特征也有使用。

    • 在训练样本达到 17 亿后,ndcg 在测试集和训练集的表现一致,这样一来可以使用线下的效果来评估线上的效果?

    • 深度模型在图像分类任务上已经超过人类的表现,但是很难判断是否在搜索任务上也超过人类。一个关键是很难定义搜索任务中的人类能力。

Failed Models

  • Listing ID

    • listing id 进行 embedding,但是出现过拟合。

      • embedding 需要每个物品拥有大量数据进行训练,来挖掘他们的价值

      • 部分 Listing 有一些独特的属性,需要一定量的数据进行训练。

  • Multi-task learning

    • booking 比 view 更加稀疏,long view 和 booking 是有相关的。

    • 两个任务 Booking Logit 和 Long View Logit,共享网络结构。两个指标在数量级上有差异,为了更加关注 booking 的指标,long view label 乘上 log(view_duration)。

      • 线上实验中,页面浏览的时间提高,但是 booking 指标基本不变。

      • 作者分析长时间浏览一个页面可能是高档房源或页面描述比较长。

  • [[Feature Engineering]]

    • GBDT 常用的特征工程方法:计算比值,滑动窗口平均以及特征组合。

    • NN 可以通过隐层自动进行特征交叉,对特征进行一定程度上的处理可以让 NN 更加有效。

  • Feature [[Normalization]]

    • NN 对数值特征敏感,如果输入的特征过大,反向传播时梯度会很大。

    • 正态分布

    • (featurevalμ)/ρ(feature_val - \mu)/\rho

    • power law distribution

    • log1+featureval1+median\log \frac{1+feature_val}{1+median}

  • Feature distribution

    • 特征分布平滑

    • 是否存在异常值?

    • 更容易泛化,保证高层输出有良好的分布

  • [[Hyperparameters]]

    • [[Dropout]] 看成是一种数据增强的方法,模拟了数据中会出现随机缺失值的情况。drop 之后可能会导致样本不再成立,分散模型注意力的无效场景。

      • 替代方案,根据特征分布人工产生噪音加入训练样本,线下有效果,线上没有效果。
    • [[神经网络参数全部初始化为0]] 没有效果,[[Xavier Initialization]] 初始化参数,Embedding 使用 -1 到 1 的随机分布。

    • Learning rate 默认参数 [[Adam]] 效果不太好,使用 [[LazyAdam]] 在较大 embedding 场景下训练速度更快

  • Feature Importance [[可解释性]]

    • Score Decomposition 将 NN 的分数分解到特征上。[[GBDT]] 可以这样做。

    • Ablation Test 每次训练一个模型删除一个特征。问题是模型可以从剩余的特征中弥补出缺失的特征。

    • Permutation Test 选定一个特征,随机生成值。[[Random Forests]] 中常用的方法。新生成的样本可能和现实世界中的分布不同。一个特征可能和其他特征共同作用产生效果。

    • TopBot Analysis 分析排序结果 top 和 bot 的单独特征分布

      • 左边代表房子价格的分布,top 和 bot 的分布存在明显不同,代表模型对价格敏感。

      • 右边代表页面浏览量的分布,top 和 bot 的分布接近,说明模型没有很好利用这个特征。

奇怪的东西

  • 论文中一直在引用 [[Andrej Karpathy]] 的建议:don’t be a hero

Transformer

[[Abstract]]

  • The dominant sequence transduction models are based on complex recurrent or convolutional neural networks that include an encoder and a decoder. The best performing models also connect the encoder and decoder through an attention mechanism. We propose a new simple network architecture, the Transformer, based solely on attention mechanisms, dispensing with recurrence and convolutions entirely. Experiments on two machine translation tasks show these models to be superior in quality while being more parallelizable and requiring significantly less time to train. Our model achieves 28.4 BLEU on the WMT 2014 Englishto-German translation task, improving over the existing best results, including ensembles, by over 2 BLEU. On the WMT 2014 English-to-French translation task, our model establishes a new single-model state-of-the-art BLEU score of 41.8 after training for 3.5 days on eight GPUs, a small fraction of the training costs of the best models from the literature. We show that the Transformer generalizes well to other tasks by applying it successfully to English constituency parsing both with large and limited training data.

[[Attachments]]

模型结构

Encoder and Decoder Stacks

  • [[Transformer/Encoder]]

    • 每层包含两个子层,分别是 multi-head self-attention mechanism 以及 positionwise fully connected feed-forward network(其实就是一维卷积)

    • 每一层间通过 residual connection 以及 [[Layer Normalization]]。表示成 LayerNorm(x+Sublayer(x))。 [[Post Norm]]

    • 所以内部连接层的大小都是 $${d_{model}=512}$$

  • [[Transformer/Decoder]]

    • 第一个子层是 multi-head self-attention,与 encoder 不同的地方在与为了避免数据泄漏,计算位置 i 的 self attention 时,需要 mask 位置 i 之后的序列。

      • q,k,v 是上一层的输出
    • 第二个子层是 multi-head attention,用来捕获 encoder 的 output 和 decoder 的 output 之间的注意力关系。

      • k 和 v 是来自 encoder,q 来自上一层 decoder
    • 第三个子层和 Encoder 一样,是 positionwise fully connected feed-forward network

    • 训练和预测

      • 使用 teacher forcing 机制训练的 seq2seq 结构

Input Embedding

  • 为什么获取输入词向量后需要对矩阵乘以 embedding size 的平方?意义是什么?

    • embedding matrix 的初始化方法是 [[Xavier Initialization]],方差是 1/embedding size,通过 scale 更有利于matrix收敛
  • 为什么不直接使用标准正态分布?

    • tied-embedding 输入词向量层和输出词向量层(softmax 前的线性层)共享参数

    • 线性层需要用 [[Xavier Initialization]] 初始化

[[Multi-Head Attention]]

  • Attention 是 :-> [[scaled dot-product attention]]
    [[Position-wise Feed-Forward Networks]]
    Embeddings and Softmax

  • weight typing

    • seq2seq 语言模型中绑定 embedding 权重和 docoder softmax 之前的线性层权重矩阵。

    • 可以减少参数量,压缩模型大小,embedding 训练更加充分,缓解每次更新少数词 embedding 的情况。

  • 权重共享

    • decoder 和 encoder embedding 层
      • 原始任务是 English-German 翻译,词表使用 bpe 处理,最小单元是 subword。两种语言有相同地 subword
      • 不同语言之间可以共享对数词或符号的建模
    • decoder 中 embedding 层 和 fc 层权重
      • embedding 层通过词的 one-hot 去取对应的向量

      • fc 层通过向量去得到可能是某个此词的 softmax 概率最大,和词相同的那一行对应的点积和softmax概率会是最大的。

        • 两数和相同的情况下,两数相等对应的积最大。
  • 每个权重放大 dmodel{\sqrt{d_{model}}}

    • [[Xavier Initialization]] Embedding 的方差是 1/d_model,当 d_model 值非常大时,矩阵中的每一个值都会减小。

      • [[Positional Encoding]] 是通过三角函数算出来的,值域为 -1 到 1。需要放大 embedding 数值,否则规模不一致相加后会丢失信息。 [[BERT]] 中不需要调整

使用 self-attention 相当于处理[[词袋模型]],忽视单词之间位置关系,通过引入 [[Position Representation]] 弥补模型缺陷。

  • [[Positional Encoding/Sinusoidal]]

  • Positional Embedding

    • 将位置转化成 embedding,然后当成参数学习

    • 论文中的实验表明两种方式的效果相差不大

      • 参数数量增加,模型学习难度增大,需要使用更多数据训练

      • 无法解决预测文本长度超过训练文本长度的情况, [[BERT]] 强制文本长度 512

  • 单词 embedding 和对应位置表示直接相加

[[Transformer/why self-attention]]

[[Transformer/train]]


@Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction

[[Abstract]]

  • CTR prediction in real-world business is a difficult machine learning problem with large scale nonlinear sparse data. In this paper, we introduce an industrial strength solution with model named Large Scale Piece-wise Linear Model (LS-PLM). We formulate the learning problem with L1 and L2,1 regularizers, leading to a non-convex and non-smooth optimization problem. Then, we propose a novel algorithm to solve it efficiently, based on directional derivatives and quasi-Newton method. In addition, we design a distributed system which can run on hundreds of machines parallel and provides us with the industrial scalability. LS-PLM model can capture nonlinear patterns from massive sparse data, saving us from heavy feature engineering jobs. Since 2012, LS-PLM has become the main CTR prediction model in Alibaba’s online display advertising system, serving hundreds of millions users every day.

[[Attachments]]

分片线性方式对数据进行拟合,将空间分成多个区域,每个区域使用线性的方式进行拟合,最后的输出变为多个子区域预测值的加权平均。

相当于对多个区域做一个 [[Attention]]

结构与三层神经网络类似

Model

处理大规模稀疏非线性特征

LS-PLM 模型学习数据的非线性特征。

question 为什么 LR 模型不能区分下面的数据,如何区分数据?[[SVM]][[FM]]

p(y=1x)=g(j=1mσ(ujTx)η(wjTx))p(y=1 | x)=g\left(\sum_{j=1}^{m} \sigma\left(u_{j}^{T} x\right) \eta\left(w_{j}^{T} x\right)\right)

u 和 w 都是 d 维向量

m 为划分 region 数量

一般化使用:

p(y=1x)=i=1mexp(uiTx)j=1mexp(ujTx)11+exp(wiTx)p(y=1 | x)=\sum_{i=1}^{m} \frac{\exp \left(u_{i}^{T} x\right)}{\sum_{j=1}^{m} \exp \left(u_{j}^{T} x\right)} \cdot \frac{1}{1+\exp \left(-w_{i}^{T} x\right)}

可以把上面的模型看成是三层神经网络

Regularization

  • argminΘf(Θ)=loss(Θ)+λΘ2,1+βΘ1\arg \min _{\Theta} f(\Theta)=\operatorname{loss}(\Theta)+\lambda\|\Theta\|_{2,1}+\beta\|\Theta\|_{1}

  • L1 和常规一样,保持参数的稀疏性。

  • L2 如下面的公式,对每一个 feature 的参数进行二阶正则,然后累加。最优化的过程中,L2 项越来越小,相当于做特征选择。每一个特征不止一个参数,只有某一个特征的全部参数都为 0 ,代表这个特征是没有用的。

  • Θ2,1=i=1dj=12mθij2\|\Theta\|_{2,1}=\sum_{i=1}^{d} \sqrt{\sum_{j=1}^{2 m} \theta_{i j}^{2}}

  • 正则后的效果:

-w839

@wait 后面如何求解这损失函数以及工程实现待看。

[[Ref]]


@Item2Vec: Neural Item Embedding for Collaborative Filtering

[[Abstract]]

  • Many Collaborative Filtering (CF) algorithms are item-based in the sense that they analyze item-item relations in order to produce item similarities.

  • Recently, several works in the field of Natural Language Processing (NLP) suggested to learn a latent representation of words using neural embedding algorithms. Among them, the Skip-gram with Negative Sampling (SGNS), also known as [[word2vec]], was shown to provide state-of-the-art results on various linguistics tasks.

  • In this paper, we show that item-based CF can be cast in the same framework of neural word embedding.

    • Inspired by SGNS, we describe a method we name item2vec for item-based CF that produces embedding for items in a latent space.

    • The method is capable of inferring item-item relations even when user information is not available.

  • We present experimental results that demonstrate the effectiveness of the item2vec method and show it is competitive with SVD.

[[Attachments]]


@Real-time Personalization using Embeddings for Search Ranking at Airbnb

[[Abstract]]

  • Search Ranking and Recommendations are fundamental problems of crucial interest to major Internet companies, including web search engines, content publishing websites and marketplaces. However, despite sharing some common characteristics a one-size-fitsall solution does not exist in this space. Given a large difference in content that needs to be ranked, personalized and recommended, each marketplace has a somewhat unique challenge. Correspondingly, at Airbnb, a short-term rental marketplace, search and recommendation problems are quite unique, being a two-sided marketplace in which one needs to optimize for host and guest preferences, in a world where a user rarely consumes the same item twice and one listing can accept only one guest for a certain set of dates. In this paper we describe Listing and User Embedding techniques we developed and deployed for purposes of Real-time Personalization in Search Ranking and Similar Listing Recommendations, two channels that drive 99% of conversions. The embedding models were specifically tailored for Airbnb marketplace, and are able to capture guest’s short-term and long-term interests, delivering effective home listing recommendations. We conducted rigorous offline testing of the embedding models, followed by successful online tests before fully deploying them into production.

[[Attachments]]

论文解决问题:

  • 数据稀疏情况下如何进行训练?

    • 稀疏 id 聚类处理
  • 将 user id 和 listing id 聚合成 user type、listing type

  • word2vec 中加入 booked listing 作为 global context

目标函数

image.png

argmaxθ(l,c)Dplog11+evcvl+(l,c)Dnlog11+evcvl+log11+evlvl\underset{\theta}{\operatorname{argmax}} \sum_{(l, c) \in \mathcal{D}_{p}} \log \frac{1}{1+e^{-\mathbf{v}_{c}^{\prime} \mathbf{v}_{l}}}+\sum_{(l, c) \in \mathcal{D}_{n}} \log \frac{1}{1+e^{\mathbf{v}_{c}^{\prime} \mathbf{v}_{l}}}+\log \frac{1}{1+e^{-\mathbf{v}_{l}^{\prime} \mathbf{v}_{l}}}

booked listing 是正样本,所以有一个负号。由于只有一项,所以没有 sigma 符号。

原始负采样是在全体样本中随机选择,业务特殊性,在正样本同一市场的样本中负采样。更好能发现同一市场中内部 listing 的差异。

new listing 通过附近 3 个同类型、相似价格的 listing embedding 进行平均。

word2vec 方法本来是无监督的,通过引入 booked listing 以及 reject 信息传递部分监督信息

利用 booked 数据训练时,文章中提到 user type 和 listing type 要在同一个空间,不知道为什么下面要有两个公式,以及 c 的含义是什么?

argmaxθ(ut,c)Dbooklog11+evcvut[[]]+(ut,c)Dneglog11+evcvut[[]]\underset{\theta}{\operatorname{argmax}} \sum_{\left(u_{t}, c\right) \in \mathcal{D}_{b o o k}} \log \frac{1}{1+e^{-v_{c}^{\prime} v_{u_{t[[}}}}]]+\sum_{\left(u_{t}, c\right) \in \mathcal{D}_{n e g}} \log \frac{1}{1+e^{v_{c}^{\prime} v_{u_{t[[}}}}]]

argmaxθ(lt,c)Dbooklog11+evcvlt[[]]+(lt,c)Dneglog11+evcvlt[[]]\underset{\theta}{\operatorname{argmax}} \sum_{\left(l_{t}, c\right) \in \mathcal{D}_{b o o k}} \log \frac{1}{1+e^{-\mathrm{v}_{c}^{\prime} \mathrm{v}_{l_{t[[}}}}]]+\sum_{\left(l_{t}, c\right) \in \mathcal{D}_{\text {neg}}} \log \frac{1}{1+e^{v_{c}^{\prime} \mathrm{v}_{l_{t[[}}}}]]

embedding 之间直接对比需要在相同的向量空间。根据上面的计算可以生成一个特征 UserTypeListingTypeSim。

在如何在实时模型中引入 embedding 特征:根据一些规则收集一些 listing,计算这些 listing embedding 的平均值,再和当前排序的 lisitng 计算一个相似度,做为一个特征放到模型中。

如何评估按上面方法生成的 emb sim 特征重要性?利用 GBDT?

[[冷启动]] listing embeddings

  • 找方圆 10 英里之内的 3 个最相似的 listing,取 listing embedding 的平均

使用 Embedding 方法要考虑的问题:

    1. 希望Embedding表达什么,即选择哪一种方式构建语料
    1. 如何让Embedding向量学到东西
    1. 如何评估向量的效果
    1. 线上如何使用