@Transformers in Time Series: A Survey

[[Abstract]]

  • Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interests in the time series community.

  • Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications.

    • In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series transformers in two perspectives.

      • From the perspective of network structure, we summarize the adaptations and modification that have been made to transformer in order to accommodate the challenges in time series analysis.

      • From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classification.

    • Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how transformers perform in time series.

    • Finally, we discuss and suggest future directions to provide useful research guidance.

  • A corresponding resource list which will be continuously updated can be found in the GitHub repository1.

[[Attachments]]

Input Encoding and Positional Encoding

  • Absolute Positional Encoding

  • Relative Positional Encoding

  • Hybrid positional encodings

Network Modifications for Time Series

  + [[LogTrans]] [ Li et al., 2019 ] and [\[\[Pyraformer\]\]](/post/logseq/%40Pyraformer%3A%20Low-Complexity%20Pyramidal%20Attention%20for%20Long-Range%20Time%20Series%20Modeling%20and%20Forecasting.html) explicitly introducing a sparsity bias

  + 移除 self-attention 矩阵部分值 [\[\[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting\]\]](/post/logseq/%40Informer%3A%20Beyond%20Efficient%20Transformer%20for%20Long%20Sequence%20Time-Series%20Forecasting.html) [[FEDformer]]
  • Architecture Level

    • renovate transformer

    • hierarchical architecture 分层结构

Applications of Time Series Transformers

  • Forecasting

    • Time Series Forecasting

      • [[LogTrans]]

        • proposed convolutional self-attention by employing causal convolutions to generate queries and keys in the self-attention layer 因果卷积引入子注意力计算

        • a Logsparse mask

      • [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]

      • AST [[Adversarial sparse transformer for time series forecasting]]

        • 使用生成对抗编码器-解码器架训练用于时间序列预测的稀疏 Transformer 模型

        • 对抗训练可以直接塑造网络的输出来改善预测效果,避免逐步预测带来的累积误差

          • directly shaping the output distribution of network to avoid the error accumulation through one-step ahead inference
      • [[Autoformer]]

        • simple seasonaltrend decomposition architecture 简单季节性趋势分解架构

        • an auto-correlation mechanism working as an attention module 自相关机制注意力模块 O(LlogL)O(L\log L)

          • measures the time-delay similarity between inputs signal and aggregate the top-k similar sub-series to produce the output
      • [[FEDformer]]

        • 利用 [[Fourier transform]] 和 [[Wavelet transform]] 处理 frequency domain 频域中的注意力操作

          • linear complexity
      • [[@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]]

        • multi-horizon forecasting model with static covariate encoders, gating feature selection and temporal self-attention decoder
      • [[SSDNet]] [[ProTran]]

        • combine Transformer with state space models to provide probabilistic forecasts 提供概率预测
      • [[Pyraformer]]

        • hierarchical pyramidal attention module with binary tree following path

        • 分层金字塔注意力模块,二叉树

      • [[Aliformer]]

        • Knowledge-guided attention
    • Spatio-Temporal Forecasting [[Traffic Flow Forecasting]]

      • Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

        • self attention module to capture temporal-temporal dependencies 时序特征

        • Graph neural network module to capture spatial dependencies 空间特征

      • Spatialtemporal Transformer

        • 空间 transformer 辅助图卷积网络来捕获空间依赖关系
      • Spatio-temporal graph Transformer

        • 基于注意力的图卷积机制
    • Event Forecasting

      • temporal point processes (TPP)
  • Anomaly Detection

  • Classification

    • [[GTN]]

Experimental Evaluation and Discussion

模型鲁棒性、模型大小以及对时序季节性和趋势捕捉能力

  • robustness analysis, model size analysis, and

seasonal-trend decomposition analysis

  • seasonal-trend decomposition 是 transformer 解决时序预测的重要组成部分

  • 所有模型加上 moving average trend decomposition architecture proposed 结构后,和原始模型相比效果都获得提升

Future Research Opportunities

  • [[inductive bias]] for Time Series Transformers

    • 避免过拟合,训练 transformer 需要大量数据。

    • 时序数据具有 seasonal/periodic and trend patterns

    • 将对于时序数据模型的理解和特定任务的特征做为归纳偏置引入 transformer

  • [[GNN]]

    • 增强对于空间依赖和多维度之间的关系建模能力
  • [[预训练]]

    • 目前针对时间序列的预训练 transformer 集中在时序分类任务中
  • [[Neural architecture search]]

    • 如果构建高效的 transformer 结构

Ref


@FiBiNET: Combining Feature Importance and Bilinear feature Interaction for Click-Through Rate Prediction

[[Abstract]]

  • Advertising and feed ranking are essential to many Internet companies such as Facebook and Sina Weibo. Among many real-world advertising and feed ranking systems, click through rate (CTR) prediction plays a central role. There are many proposed models in this field such as logistic regression, tree based models, factorization machine based models and deep learning based CTR models. However, many current works calculate the feature interactions in a simple way such as Hadamard product and inner product and they care less about the importance of features. In this paper, a new model named FiBiNET as an abbreviation for Feature Importance and Bilinear feature Interaction NETwork is proposed to dynamically learn the feature importance and fine-grained feature interactions. On the one hand, the FiBiNET can dynamically learn the importance of features via the Squeeze-Excitation network (SENET) mechanism; on the other hand, it is able to effectively learn the feature interactions via bilinear function. We conduct extensive experiments on two realworld datasets and show that our shallow model outperforms other shallow models such as factorization machine(FM) and field-aware factorization machine(FFM). In order to improve performance further, we combine a classical deep neural network(DNN) component with the shallow model to be a deep model. The deep FiBiNET consistently outperforms the other state-of-the-art deep models such as DeepFM and extreme deep factorization machine(XdeepFM).

[[Attachments]]

创新点 #card

  • 使用 SENET 层对 embedding 进行加权

  • 使用 Bilinear-Interaction Layer 进行特征交叉

背景

  • importance of features and feature interactions
    网络结构 #card
    image.png

[[SENET]] Squeeze-and-Excitation network 特征加权方法

  • Squeeze f*k 压缩成 f*1
  • Excitation 然后通过两层 dnn 变成得到权重 f*1
  • Re-Weight 将结果和原始 f*k 相乘。
  • 原理是想通过控制scale的大小,把重要的特征增强,不重要的特征减弱,从而让提取的特征指向性更强。
    [[Bilinear-Interaction Layer]] :-> 结合Inner Product和Hadamard Product方式,并引入额外参数矩阵W,学习特征交叉。
  • 不同特征之间交叉 v_i * W * v_j 时,权重矩阵来源
    • Field-All Type :-> 全体共享 W
      • 参数量 :-> feature * embedding + emb*emb
    • Field-Each Type :-> 每个 filed 一组 W
      • 参数量 :-> feature * embedding + feature*emb*emb
    • Filed-Interaction Type :-> 不同特征之间有一组 w
      • 参数量 :-> feature * embedding + feature*feature*emb*emb
  • Bilinear-Interaction Layer 对比 [[FFM]] 有效减少参数量
    • FM 参数量 :-> feature * embedding
    • FFM 参数量 :-> feature * filed * embedding
  • 不同特征交叉方式

image.png
occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTcyNDE5ODg5MzM4OV8wLnBuZyI6eyJjb25maWciOnsiaGlkZUFsbFRlc3RPbmUiOnRydWV9LCJlbGVtZW50cyI6W3sibGVmdCI6MjU5LjE3MTgyMTU5NDIzODMsInRvcCI6MjA5LjQ1OTk3ODc0MTI3MzI2LCJ3aWR0aCI6MzY3LjAxMDMwOTg1NTE0MzIsImhlaWdodCI6MjYyLjk5MTYzODUzMTQxNiwiYW5nbGUiOjAsImNJZCI6MX0seyJsZWZ0IjozNjUuMzM4NDg4MjYwOTA0OTcsInRvcCI6NDAxLjQ5OTc1MDg4ODczMzIsIndpZHRoIjoxNjguMDEwMzA5ODU1MTQzMiwiaGVpZ2h0Ijo5OC43MzkxMDI4MjMzOTIyNywiYW5nbGUiOjAsImNJZCI6MX0seyJsZWZ0IjozMTUuNjcxODIxNTk0MjM4MywidG9wIjo1OTkuMTUxNTc0MjkyNDA4Mywid2lkdGgiOjQ4Ny4zNDM2NDMxODg0NzY1NiwiaGVpZ2h0IjoxODMuMDU4MDIwMjA1NjMzNSwiYW5nbGUiOjAsImNJZCI6Mn0seyJsZWZ0Ijo0MzkuMDA1MTU0OTI3NTcxNiwidG9wIjo3NDMuNTA3MDcyODI1OTA0MSwid2lkdGgiOjIzNS45ODk2OTAxNDQ4NTY4LCJoZWlnaHQiOjcyLjM0NTg1NDM0ODE5MiwiYW5nbGUiOjAsImNJZCI6Mn0seyJsZWZ0Ijo4OTEuMTcxODIxNTk0MjM4MywidG9wIjozODMuNTE1NDc3MzgwODMzNTQsIndpZHRoIjo2MjYuMzQzNjQzMTg4NDc2NiwiaGVpZ2h0IjoyODEuMjU4OTg4ODk3MTIyMiwiYW5nbGUiOjAsImNJZCI6M30seyJsZWZ0Ijo5NTkuMTcxODIxNTk0MjM4MywidG9wIjo1NzkuOTYwNDM0NjU4NDgyMywid2lkdGgiOjI5MC4zNDM2NDMxODg0NzY1NiwiaGVpZ2h0Ijo4MS41NTAzODQ5MTgxODgwMiwiYW5nbGUiOjAsImNJZCI6M31dfX0=
[[ETA]] fm 交叉部分可以尝试引入 bi layer,使用 link 状态组合 W。

  • 但是路况状态可能会改变

Ref


@Applying Deep Learning To Airbnb Search

[[Abstract]]

  • The application to search ranking is one of the biggest machine learning success stories at Airbnb. Much of the initial gains were driven by a gradient boosted decision tree model. The gains, however, plateaued over time. This paper discusses the work done in applying neural networks in an attempt to break out of that plateau. We present our perspective not with the intention of pushing the frontier of new modeling techniques. Instead, ours is a story of the elements we found useful in applying neural networks to a real life product. Deep learning was steep learning for us. To other teams embarking on similar journeys, we hope an account of our struggles and triumphs will provide some useful pointers. Bon voyage!

[[Attachments]]

记录 Airbnb 深度模型探索历程。

业务:顾客查询后返回一个有序的列表(Listing,对应房间)。

深度模型之前使用 GBDT 对房子进行打分。

Model Evolution

  • 评价指标 [[NDCG]]

  • Simple NN

    • 32 层 NN + Relu,特征和优化目标和 GBDT 效果相同

    • 打通深度模型训练和线上预测的 pipeline。

  • LambdaRank NN

    • [[LambdaRank]] 直接优化 NDCG

    • 采用 pairwise 的训练方式,构造 <被预定的房间,未被预定的房间> 的训练样本

    • pairwise loss 乘上 item 对调顺序带来的指标变化 NDCG,关注列表中靠前位置的样本

      • 比如 booked listing 从 2 到 1 的价值比 books listing 从 10 到 9 的意义大。
  • Decision Tree/Factorization Machine NN

+ GBDT 叶子节点位置(类别特征) + FM 预测结果放到 NN 中。
  • Deep NN

    • 10 倍训练数据,195 个特征(离散特征 embedding 化),两层神经网络:127 fc + relu 以及 83 fc + relu。

    • 部分 dnn 的特征是来自其他模型,之间在 [[@Real-time Personalization using Embeddings for Search Ranking at Airbnb]] 里面提到的特征也有使用。

    • 在训练样本达到 17 亿后,ndcg 在测试集和训练集的表现一致,这样一来可以使用线下的效果来评估线上的效果?

    • 深度模型在图像分类任务上已经超过人类的表现,但是很难判断是否在搜索任务上也超过人类。一个关键是很难定义搜索任务中的人类能力。

Failed Models

  • Listing ID

    • listing id 进行 embedding,但是出现过拟合。

      • embedding 需要每个物品拥有大量数据进行训练,来挖掘他们的价值

      • 部分 Listing 有一些独特的属性,需要一定量的数据进行训练。

  • Multi-task learning

    • booking 比 view 更加稀疏,long view 和 booking 是有相关的。

    • 两个任务 Booking Logit 和 Long View Logit,共享网络结构。两个指标在数量级上有差异,为了更加关注 booking 的指标,long view label 乘上 log(view_duration)。

      • 线上实验中,页面浏览的时间提高,但是 booking 指标基本不变。

      • 作者分析长时间浏览一个页面可能是高档房源或页面描述比较长。

  • [[Feature Engineering]]

    • GBDT 常用的特征工程方法:计算比值,滑动窗口平均以及特征组合。

    • NN 可以通过隐层自动进行特征交叉,对特征进行一定程度上的处理可以让 NN 更加有效。

  • Feature [[Normalization]]

    • NN 对数值特征敏感,如果输入的特征过大,反向传播时梯度会很大。

    • 正态分布

    • (featurevalμ)/ρ(feature_val - \mu)/\rho

    • power law distribution

    • log1+featureval1+median\log \frac{1+feature_val}{1+median}

  • Feature distribution

    • 特征分布平滑

    • 是否存在异常值?

    • 更容易泛化,保证高层输出有良好的分布

  • [[Hyperparameters]]

    • [[Dropout]] 看成是一种数据增强的方法,模拟了数据中会出现随机缺失值的情况。drop 之后可能会导致样本不再成立,分散模型注意力的无效场景。

      • 替代方案,根据特征分布人工产生噪音加入训练样本,线下有效果,线上没有效果。
    • [[神经网络参数全部初始化为0]] 没有效果,[[Xavier Initialization]] 初始化参数,Embedding 使用 -1 到 1 的随机分布。

    • Learning rate 默认参数 [[Adam]] 效果不太好,使用 [[LazyAdam]] 在较大 embedding 场景下训练速度更快

  • Feature Importance [[可解释性]]

    • Score Decomposition 将 NN 的分数分解到特征上。[[GBDT]] 可以这样做。

    • Ablation Test 每次训练一个模型删除一个特征。问题是模型可以从剩余的特征中弥补出缺失的特征。

    • Permutation Test 选定一个特征,随机生成值。[[Random Forests]] 中常用的方法。新生成的样本可能和现实世界中的分布不同。一个特征可能和其他特征共同作用产生效果。

    • TopBot Analysis 分析排序结果 top 和 bot 的单独特征分布

      • 左边代表房子价格的分布,top 和 bot 的分布存在明显不同,代表模型对价格敏感。

      • 右边代表页面浏览量的分布,top 和 bot 的分布接近,说明模型没有很好利用这个特征。

奇怪的东西

  • 论文中一直在引用 [[Andrej Karpathy]] 的建议:don’t be a hero

@Learning Piece-wise Linear Models from Large Scale Data for Ad Click Prediction

[[Abstract]]

  • CTR prediction in real-world business is a difficult machine learning problem with large scale nonlinear sparse data. In this paper, we introduce an industrial strength solution with model named Large Scale Piece-wise Linear Model (LS-PLM). We formulate the learning problem with L1 and L2,1 regularizers, leading to a non-convex and non-smooth optimization problem. Then, we propose a novel algorithm to solve it efficiently, based on directional derivatives and quasi-Newton method. In addition, we design a distributed system which can run on hundreds of machines parallel and provides us with the industrial scalability. LS-PLM model can capture nonlinear patterns from massive sparse data, saving us from heavy feature engineering jobs. Since 2012, LS-PLM has become the main CTR prediction model in Alibaba’s online display advertising system, serving hundreds of millions users every day.

[[Attachments]]

分片线性方式对数据进行拟合,将空间分成多个区域,每个区域使用线性的方式进行拟合,最后的输出变为多个子区域预测值的加权平均。

相当于对多个区域做一个 [[Attention]]

结构与三层神经网络类似

Model

处理大规模稀疏非线性特征

LS-PLM 模型学习数据的非线性特征。

question 为什么 LR 模型不能区分下面的数据,如何区分数据?[[SVM]][[FM]]

p(y=1x)=g(j=1mσ(ujTx)η(wjTx))p(y=1 | x)=g\left(\sum_{j=1}^{m} \sigma\left(u_{j}^{T} x\right) \eta\left(w_{j}^{T} x\right)\right)

u 和 w 都是 d 维向量

m 为划分 region 数量

一般化使用:

p(y=1x)=i=1mexp(uiTx)j=1mexp(ujTx)11+exp(wiTx)p(y=1 | x)=\sum_{i=1}^{m} \frac{\exp \left(u_{i}^{T} x\right)}{\sum_{j=1}^{m} \exp \left(u_{j}^{T} x\right)} \cdot \frac{1}{1+\exp \left(-w_{i}^{T} x\right)}

可以把上面的模型看成是三层神经网络

Regularization

  • argminΘf(Θ)=loss(Θ)+λΘ2,1+βΘ1\arg \min _{\Theta} f(\Theta)=\operatorname{loss}(\Theta)+\lambda\|\Theta\|_{2,1}+\beta\|\Theta\|_{1}

  • L1 和常规一样,保持参数的稀疏性。

  • L2 如下面的公式,对每一个 feature 的参数进行二阶正则,然后累加。最优化的过程中,L2 项越来越小,相当于做特征选择。每一个特征不止一个参数,只有某一个特征的全部参数都为 0 ,代表这个特征是没有用的。

  • Θ2,1=i=1dj=12mθij2\|\Theta\|_{2,1}=\sum_{i=1}^{d} \sqrt{\sum_{j=1}^{2 m} \theta_{i j}^{2}}

  • 正则后的效果:

-w839

@wait 后面如何求解这损失函数以及工程实现待看。

[[Ref]]