A Survey of Transformers

[[PTM]] pre-train-model

17 年 Google 发表论文 「Attention is all you need」提出 Transformers 框架,之后一大批人在此基础上进行研究和应用。原始 Transformer 改进的变体被称为 「X-formers」。

X-formers 改进方向有三个:

  • Model Efficiency

    • self-attetion 带来的计算量和参数量(内存)

      • sparse attention 轻量级注意力机制方案

      • divide-and-conquer methods 分治方法

  • Model Generalization

    • 框架灵活,对数据没有太多的结构偏置

    • 训练需要数据量大

    • structural bias or regularization, pre-training on large-scale unlabeled data

  • Model Adaptation

    • 将 Transformer 应用到具体的下游任务中。

背景知识

模型使用形式

  • Encoder-Decoder

  • Encoder only

    • classification or sequence labeling
  • Decoder only

    • sequence generation

      • language modeling

根据对原始 Transformer 的改进分类:architecture modification, pre-training, and applications

  • architecture modification

    • Module Level

      • [[Attention]]

        • 挑战

          • 计算复杂度,受序列长度影响

          • Structural prior 没有结构先验,在小数据集上容易过拟合

        • Sparse Attention

          • token i 和 j 有关系的情况下计算 attention,以稀疏矩阵形式保存

          • 如何定义关系

            • position-based

              • 计算指定位置之间的 attention

              • atomic sparse attention

                • Global

                • Band

                • Dilated

              • 组合 atomic attention 得到更加复杂的attention计算规则

            • content-based

              • Routing Transformer

                • Efficient Content-Based Sparse Attention with RoutingTransformers

                • 聚类

              • [[Reformer]]

                • 使用 LSH,同一个分桶内的 token 计算 attention

        • [[Linearized Attention]]

          • QKV 计算 Attention 的复杂度是 $$O(T^2D)$$,通过引入核函数降低到 $$O(TD)$$

          • key components

            • kernel feature map

              • Performer 用其他函数去拟合 attention 函数

                • FAVOR+ Fast Attention Via positive Orthogonal Random features approach

            • aggregation rule

        • Query Prototyping and Memory Compression

        • Low-rank Self-Attention

          • attention 矩阵线性相关的行数 A 远小于输入 T

          • Low-rank Parameterization

          • Low-rank Approximation

        • Attention with Prior

          • 分成 generated attention 和 prior attention 两部分,下面的方法都是生成 prior attention 尝试

          • Prior that Models locality

            • 文本之类的数据对位置敏感,使用 i 和 j 的位置,结合[[Normal Distribution]]计算先验信息

          • Prior from Lower Modules

            • 使用之前层的注意力分布结果

            • Realformer 将 [[ResNet]] 结构放到 Attention 矩阵中

            • Lazyformer 每两层计算一次 Attention 矩阵

          • Prior as Multi-task Adapters

            • 多任务适配器,看起来是在共享参数

          • Attention with Only Prior

            • 只使用先验
        • Improved Multi-Head Mechanism

          • Head Behavior Modeling

          • Multi-head with Restricted Spans

            • 观察到原始中部分 head 关注局部,部分关注全局

            • 限制 attention 的范围(通过距离实现)

            • decoder 中 mask-multi-head 就是这个思路

          • Multi-head with Refined Aggregation

            • 多头的结果如何合并

            • routing methods

          • Other Modifications

            • Shazeer multi-query attention 所有头之间共享 kv

            • Bhojanapalli 灵活设置 head size

      • OTHER MODULE-LEVEL MODIFICATIONS

        • Position Representations

          • Transformer 具有排列不变性,需要而外位置信息

          • Absolute Position Representations

            • 正余弦编码

            • 位置向量

          • Relative Position Representations.

            • token 之间的关系更加重要

            • 将 embedding 加到 key 的attention中

            • Transformer-XL

          • Other Representations

            • TUPE

              • 混合相对和绝对位置
            • Roformer

              • 旋转位置编码

              • 线性 attention 中实现相对位置编码

            • Position Representations without Explicit Encoding 不要编码

              • R-Transformer 先过 RNN 再将输出结果输入到多头

              • CPE 使用卷积

            • Position Representation on Transformer Decoders

              • 移除 pe
        • LayerNorm

          • Placement of Layer Normalization

            • post-LN

            • pre-LN 保证 skip 链接路上没有其他操作

          • Substitutes of Layer Normalization

            • 可学习参数效果不好,

            • AdaNorm

            • scaled l2 normalization

            • PowerNorm

          • Normalization-free Transformer

            • ReZero 可学习残差模块替代 LN
        • FFN

          • Activation Function in FFN

            • [[Swish]]

            • [[GPT]] [[GELU]]

          • Adapting FFN for Larger Capacity

            • product-key memory layers

            • MoE

              • 取 top k 专家

              • 取最大专家

              • 分组取各自 top1

          • Dropping FFN Layers

            • 简化网络

    • Arch. Level

      • Adapting Transformer to Be Lightweight

        • Lite Transformer

        • Funnel Transformer

          • hidden sequence pooling and up-sampling
        • DeLighT

          • DeLighT block
      • Strengthening Cross-Block Connectivity

        • 针对 decoder 解决问题

        • Transparent Attention

        • Feedback Transformer

          • 使用前一步所有层的信息
      • [[Adaptive Computation Time]]

        • 解决之前模型中层数固定

        • 三种方法

          • [[Universal Transformers]]dynamic halting

            • 达到停止条件的 token 不再改变
          • CCT

            • 跳层
      • Transformers with Divide-and-Conquer Strategies

        • 将 LM 任务中长文本拆分成多个片段

        • Recurrent Transformers 上一个 T 输出信息输入到下一个输入

          • Transformer-XL 上一个输出和下一个输入 concat 在一起

        • Hierarchical Transformers 多个结果聚合

          • Hierarchical for long sequence inputs

            • sentence Transformer and document Transformer
          • Hierarchical for richer representations 更丰富的表示

            • 字母级别表示和词级别表示
      • Exploring Alternative Architecture

        • NAS
    • PRE-TRAINED TRANSFORMERS

    • APPLICATIONS OF TRANSFORMER

      • CV

        • [[ViT]]
    • CONCLUSION AND FUTURE DIRECTIONS

      • 理论分析

      • 更好全局交互机制

      • 处理多种类数据的框架

[[Layer Normalization]]


@Transformers in Time Series: A Survey

[[Abstract]]

  • Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interests in the time series community.

  • Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications.

    • In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series transformers in two perspectives.

      • From the perspective of network structure, we summarize the adaptations and modification that have been made to transformer in order to accommodate the challenges in time series analysis.

      • From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classification.

    • Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how transformers perform in time series.

    • Finally, we discuss and suggest future directions to provide useful research guidance.

  • A corresponding resource list which will be continuously updated can be found in the GitHub repository1.

[[Attachments]]

Input Encoding and Positional Encoding

  • Absolute Positional Encoding

  • Relative Positional Encoding

  • Hybrid positional encodings

Network Modifications for Time Series

  + [[LogTrans]] [ Li et al., 2019 ] and [\[\[Pyraformer\]\]](/post/logseq/%40Pyraformer%3A%20Low-Complexity%20Pyramidal%20Attention%20for%20Long-Range%20Time%20Series%20Modeling%20and%20Forecasting.html) explicitly introducing a sparsity bias

  + 移除 self-attention 矩阵部分值 [\[\[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting\]\]](/post/logseq/%40Informer%3A%20Beyond%20Efficient%20Transformer%20for%20Long%20Sequence%20Time-Series%20Forecasting.html) [[FEDformer]]
  • Architecture Level

    • renovate transformer

    • hierarchical architecture 分层结构

Applications of Time Series Transformers

  • Forecasting

    • Time Series Forecasting

      • [[LogTrans]]

        • proposed convolutional self-attention by employing causal convolutions to generate queries and keys in the self-attention layer 因果卷积引入子注意力计算

        • a Logsparse mask

      • [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]

      • AST [[Adversarial sparse transformer for time series forecasting]]

        • 使用生成对抗编码器-解码器架训练用于时间序列预测的稀疏 Transformer 模型

        • 对抗训练可以直接塑造网络的输出来改善预测效果,避免逐步预测带来的累积误差

          • directly shaping the output distribution of network to avoid the error accumulation through one-step ahead inference
      • [[Autoformer]]

        • simple seasonaltrend decomposition architecture 简单季节性趋势分解架构

        • an auto-correlation mechanism working as an attention module 自相关机制注意力模块 O(LlogL)O(L\log L)

          • measures the time-delay similarity between inputs signal and aggregate the top-k similar sub-series to produce the output
      • [[FEDformer]]

        • 利用 [[Fourier transform]] 和 [[Wavelet transform]] 处理 frequency domain 频域中的注意力操作

          • linear complexity
      • [[@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]]

        • multi-horizon forecasting model with static covariate encoders, gating feature selection and temporal self-attention decoder
      • [[SSDNet]] [[ProTran]]

        • combine Transformer with state space models to provide probabilistic forecasts 提供概率预测
      • [[Pyraformer]]

        • hierarchical pyramidal attention module with binary tree following path

        • 分层金字塔注意力模块,二叉树

      • [[Aliformer]]

        • Knowledge-guided attention
    • Spatio-Temporal Forecasting [[Traffic Flow Forecasting]]

      • Traffic transformer: Capturing the continuity and periodicity of time series for traffic forecasting

        • self attention module to capture temporal-temporal dependencies 时序特征

        • Graph neural network module to capture spatial dependencies 空间特征

      • Spatialtemporal Transformer

        • 空间 transformer 辅助图卷积网络来捕获空间依赖关系
      • Spatio-temporal graph Transformer

        • 基于注意力的图卷积机制
    • Event Forecasting

      • temporal point processes (TPP)
  • Anomaly Detection

  • Classification

    • [[GTN]]

Experimental Evaluation and Discussion

模型鲁棒性、模型大小以及对时序季节性和趋势捕捉能力

  • robustness analysis, model size analysis, and

seasonal-trend decomposition analysis

  • seasonal-trend decomposition 是 transformer 解决时序预测的重要组成部分

  • 所有模型加上 moving average trend decomposition architecture proposed 结构后,和原始模型相比效果都获得提升

Future Research Opportunities

  • [[inductive bias]] for Time Series Transformers

    • 避免过拟合,训练 transformer 需要大量数据。

    • 时序数据具有 seasonal/periodic and trend patterns

    • 将对于时序数据模型的理解和特定任务的特征做为归纳偏置引入 transformer

  • [[GNN]]

    • 增强对于空间依赖和多维度之间的关系建模能力
  • [[预训练]]

    • 目前针对时间序列的预训练 transformer 集中在时序分类任务中
  • [[Neural architecture search]]

    • 如果构建高效的 transformer 结构

Ref