标签: Survey - 算法花园

2024-10-052025-04-14 随手记 8 分钟读完 (大约1176个字)

A Survey of Transformers

[[PTM]] pre-train-model

17 年 Google 发表论文「Attention is all you need」提出 Transformers 框架，之后一大批人在此基础上进行研究和应用。原始 Transformer 改进的变体被称为「X-formers」。

X-formers 改进方向有三个：

Model Efficiency
- self-attetion 带来的计算量和参数量(内存)
  - sparse attention 轻量级注意力机制方案
  - divide-and-conquer methods 分治方法
Model Generalization
- 框架灵活，对数据没有太多的结构偏置
- 训练需要数据量大
- structural bias or regularization, pre-training on large-scale unlabeled data
Model Adaptation
- 将 Transformer 应用到具体的下游任务中。

背景知识

见 [[Transformer]]

模型使用形式

Encoder-Decoder
Encoder only
- classification or sequence labeling
Decoder only
- sequence generation
  - language modeling

根据对原始 Transformer 的改进分类：architecture modification, pre-training, and applications

architecture modification
- Module Level
  - [[Attention]]
    - 挑战
      - 计算复杂度，受序列长度影响
      - Structural prior 没有结构先验，在小数据集上容易过拟合
    - Sparse Attention
      - token i 和 j 有关系的情况下计算 attention，以稀疏矩阵形式保存
      - 如何定义关系
        
        position-based
        
        计算指定位置之间的 attention
        
        atomic sparse attention
        
        Global
        
        Band
        
        Dilated
        
        组合 atomic attention 得到更加复杂的attention计算规则
        
        content-based
        
        Routing Transformer
        
        Efficient Content-Based Sparse Attention with RoutingTransformers
        
        聚类
        
        [[Reformer]]
        
        使用 LSH，同一个分桶内的 token 计算 attention
    - [[Linearized Attention]]
      - QKV 计算 Attention 的复杂度是 $$O(T^2D)$$，通过引入核函数降低到 $$O(TD)$$
      - key components
        
        kernel feature map
        
        Performer 用其他函数去拟合 attention 函数
        
        FAVOR+ Fast Attention Via positive Orthogonal Random features approach
        
        aggregation rule
    - Query Prototyping and Memory Compression
      - 减少 queries or key-value pairs
      - Query Prototyping 计算关键 query 的 attention 值，剩余部分填充或者采用均匀分布填充
        
        [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]
      - Memory Compression 减少 kv 数量
    - Low-rank Self-Attention
      - attention 矩阵线性相关的行数 A 远小于输入 T
      - Low-rank Parameterization
      - Low-rank Approximation
    - Attention with Prior
      - 分成 generated attention 和 prior attention 两部分，下面的方法都是生成 prior attention 尝试
      - Prior that Models locality
        
        文本之类的数据对位置敏感，使用 i 和 j 的位置，结合[[Normal Distribution]]计算先验信息
      - Prior from Lower Modules
        
        使用之前层的注意力分布结果
        
        Realformer 将 [[ResNet]] 结构放到 Attention 矩阵中
        
        Lazyformer 每两层计算一次 Attention 矩阵
      - Prior as Multi-task Adapters
        
        多任务适配器，看起来是在共享参数
      - Attention with Only Prior
        
        只使用先验
    - Improved Multi-Head Mechanism
      - Head Behavior Modeling
      - Multi-head with Restricted Spans
        
        观察到原始中部分 head 关注局部，部分关注全局
        
        限制 attention 的范围(通过距离实现)
        
        decoder 中 mask-multi-head 就是这个思路
      - Multi-head with Refined Aggregation
        
        多头的结果如何合并
        
        routing methods
      - Other Modifications
        
        Shazeer multi-query attention 所有头之间共享 kv
        
        Bhojanapalli 灵活设置 head size
  - OTHER MODULE-LEVEL MODIFICATIONS
    - Position Representations
      - Transformer 具有排列不变性，需要而外位置信息
      - Absolute Position Representations
        
        正余弦编码
        
        位置向量
      - Relative Position Representations.
        
        token 之间的关系更加重要
        
        将 embedding 加到 key 的attention中
        
        Transformer-XL
      - Other Representations
        
        TUPE
        
        混合相对和绝对位置
        
        Roformer
        
        旋转位置编码
        
        线性 attention 中实现相对位置编码
        
        Position Representations without Explicit Encoding 不要编码
        
        R-Transformer 先过 RNN 再将输出结果输入到多头
        
        CPE 使用卷积
        
        Position Representation on Transformer Decoders
        
        移除 pe
    - LayerNorm
      - Placement of Layer Normalization
        
        post-LN
        
        pre-LN 保证 skip 链接路上没有其他操作
      - Substitutes of Layer Normalization
        
        可学习参数效果不好，
        
        AdaNorm
        
        scaled l2 normalization
        
        PowerNorm
      - Normalization-free Transformer
        
        ReZero 可学习残差模块替代 LN
    - FFN
      - Activation Function in FFN
        
        [[Swish]]
        
        [[GPT]] [[GELU]]
      - Adapting FFN for Larger Capacity
        
        product-key memory layers
        
        MoE
        
        取 top k 专家
        
        取最大专家
        
        分组取各自 top1
      - Dropping FFN Layers
        
        简化网络
- Arch. Level
  - Adapting Transformer to Be Lightweight
    - Lite Transformer
    - Funnel Transformer
      - hidden sequence pooling and up-sampling
    - DeLighT
      - DeLighT block
  - Strengthening Cross-Block Connectivity
    - 针对 decoder 解决问题
    - Transparent Attention
    - Feedback Transformer
      - 使用前一步所有层的信息
  - [[Adaptive Computation Time]]
    - 解决之前模型中层数固定
    - 三种方法
      - [[Universal Transformers]]dynamic halting
        
        达到停止条件的 token 不再改变
      - CCT
        
        跳层
  - Transformers with Divide-and-Conquer Strategies
    - 将 LM 任务中长文本拆分成多个片段
    - Recurrent Transformers 上一个 T 输出信息输入到下一个输入
      - Transformer-XL 上一个输出和下一个输入 concat 在一起
    - Hierarchical Transformers 多个结果聚合
      - Hierarchical for long sequence inputs
        
        sentence Transformer and document Transformer
      - Hierarchical for richer representations 更丰富的表示
        
        字母级别表示和词级别表示
  - Exploring Alternative Architecture
    - NAS
- PRE-TRAINED TRANSFORMERS
  - Encoder only
    - [[BERT]]
  - Decoder only
- APPLICATIONS OF TRANSFORMER
  - CV
    - [[ViT]]
- CONCLUSION AND FUTURE DIRECTIONS
  - 理论分析
  - 更好全局交互机制
  - 处理多种类数据的框架

[[Layer Normalization]]

2024-10-052024-10-05 随手记几秒读完 (大约25个字)

Deep Learning for Click-Through Rate Estimation

Ref

CTR预估模型有怎样的发展规律？ - 知乎 (zhihu.com)

Paper, CTR, Survey

2022-03-072024-10-05 随手记 9 分钟读完 (大约1416个字)

@Transformers in Time Series: A Survey

[[Abstract]]

Transformers have achieved superior performances in many tasks in natural language processing and computer vision, which also triggered great interests in the time series community.
Among multiple advantages of transformers, the ability to capture long-range dependencies and interactions is especially attractive for time series modeling, leading to exciting progress in various time series applications.
- In this paper, we systematically review transformer schemes for time series modeling by highlighting their strengths as well as limitations. In particular, we examine the development of time series transformers in two perspectives.
  - From the perspective of network structure, we summarize the adaptations and modiﬁcation that have been made to transformer in order to accommodate the challenges in time series analysis.
  - From the perspective of applications, we categorize time series transformers based on common tasks including forecasting, anomaly detection, and classiﬁcation.
- Empirically, we perform robust analysis, model size analysis, and seasonal-trend decomposition analysis to study how transformers perform in time series.
- Finally, we discuss and suggest future directions to provide useful research guidance.
A corresponding resource list which will be continuously updated can be found in the GitHub repository1.

[[Attachments]]

Transformers in Time Series-2022.pdf

Input Encoding and Positional Encoding

Absolute Positional Encoding
Relative Positional Encoding
Hybrid positional encodings

Network Modiﬁcations for Time Series

[[Positional Encoding]]
- Vanilla Positional Encoding
- Learnable Positional Encoding
  - [[A transformer-based framework for multivariate time series representation learning]] introduce an embedding layer in Transformer that learn embedding vectors for each position index jointly with other model parameters.
  - [[Temporal Fusion Transformers]] 使用 LSTM 对位置进行编码，更好适应时序预测任务
- Timestamp Encoding
  - calendar timestamps(hours, minute…) 和 special timestamps (holidays and events)
  - [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]] [[Autoformer]] [[FEDformer]] 将 timestamps 特征转换成 embedding 通过网络学习
  - 如何生成好的 timestamp encoding 比较依赖人工先验
Attention Module
- 提升 self-attention 计算效率

  + [[LogTrans]] [ Li et al., 2019 ] and [\[\[Pyraformer\]\]](/post/logseq/%40Pyraformer%3A%20Low-Complexity%20Pyramidal%20Attention%20for%20Long-Range%20Time%20Series%20Modeling%20and%20Forecasting.html) explicitly introducing a sparsity bias

  + 移除 self-attention 矩阵部分值 [\[\[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting\]\]](/post/logseq/%40Informer%3A%20Beyond%20Efficient%20Transformer%20for%20Long%20Sequence%20Time-Series%20Forecasting.html) [[FEDformer]]

Architecture Level
- renovate transformer
- hierarchical architecture 分层结构
  - 针对考虑到时间序列的多分辨率(多周期，多趋势叠加)
    - [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]] max-pooling layer
    - [[Pyraformer]] C-ary tree base attention mechanism
      - nodes at the ﬁnest scale correspond to the original time series
      - nodes in the coarser scales represent series at lower resolutions
      - both intra-scale and inter-scale attentions in order to better capture temporal dependencies across different resolutions

Applications of Time Series Transformers

Forecasting
- Time Series Forecasting
  - [[LogTrans]]
    - proposed convolutional self-attention by employing causal convolutions to generate queries and keys in the self-attention layer 因果卷积引入子注意力计算
    - a Logsparse mask
  - [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]
  - AST [[Adversarial sparse transformer for time series forecasting]]
    - 使用生成对抗编码器-解码器架训练用于时间序列预测的稀疏 Transformer 模型
    - 对抗训练可以直接塑造网络的输出来改善预测效果，避免逐步预测带来的累积误差
      - directly shaping the output distribution of network to avoid the error accumulation through one-step ahead inference
  - [[Autoformer]]
    - simple seasonaltrend decomposition architecture 简单季节性趋势分解架构
    - an auto-correlation mechanism working as an attention module 自相关机制注意力模块 $O(L\log L)$
      - measures the time-delay similarity between inputs signal and aggregate the top-k similar sub-series to produce the output
  - [[FEDformer]]
    - 利用 [[Fourier transform]] 和 [[Wavelet transform]] 处理 frequency domain 频域中的注意力操作
      - linear complexity
  - [[@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]]
    - multi-horizon forecasting model with static covariate encoders, gating feature selection and temporal self-attention decoder
  - [[SSDNet]] [[ProTran]]
    - combine Transformer with state space models to provide probabilistic forecasts 提供概率预测
  - [[Pyraformer]]
    - hierarchical pyramidal attention module with binary tree following path
    - 分层金字塔注意力模块，二叉树
  - [[Aliformer]]
    - Knowledge-guided attention
- Spatio-Temporal Forecasting [[Traffic Flow Forecasting]]
  - Trafﬁc transformer: Capturing the continuity and periodicity of time series for trafﬁc forecasting
    - self attention module to capture temporal-temporal dependencies 时序特征
    - Graph neural network module to capture spatial dependencies 空间特征
  - Spatialtemporal Transformer
    - 空间 transformer 辅助图卷积网络来捕获空间依赖关系
  - Spatio-temporal graph Transformer
    - 基于注意力的图卷积机制
- Event Forecasting
  - temporal point processes (TPP)
Anomaly Detection
Classification
- [[GTN]]

Experimental Evaluation and Discussion

模型鲁棒性、模型大小以及对时序季节性和趋势捕捉能力

robustness analysis, model size analysis, and

seasonal-trend decomposition analysis

seasonal-trend decomposition 是 transformer 解决时序预测的重要组成部分
所有模型加上 moving average trend decomposition architecture proposed 结构后，和原始模型相比效果都获得提升

Future Research Opportunities

[[inductive bias]] for Time Series Transformers
- 避免过拟合，训练 transformer 需要大量数据。
- 时序数据具有 seasonal/periodic and trend patterns
- 将对于时序数据模型的理解和特定任务的特征做为归纳偏置引入 transformer
[[GNN]]
- 增强对于空间依赖和多维度之间的关系建模能力
[[预训练]]
- 目前针对时间序列的预训练 transformer 集中在时序分类任务中
[[Neural architecture search]]
- 如果构建高效的 transformer 结构

Ref

TODO [[A transformer-based framework for multivariate time series representation learning]]
TODO [[Adversarial sparse transformer for time series forecasting]]
DONE [[@Temporal Fusion Transformers for Interpretable Multi-horizon Time Series Forecasting]]
completed:: [[2022/11/08]]
TODO [[SSDNet]]
TODO [[ProTran]]
[[LogSparse Transformer]]
Transformer应用于时序任务的综述【2022by阿里达摩院】 - 知乎 (zhihu.com)
- 影响预测效果的细节
  - 训练
  - Encoder 间的特征工程

Paper, Transformer, Time Series Forecasting, Time Series Transformer, Computer Science - Artificial Intelligence, Computer Science - Machine Learning, Statistics - Machine Learning, Survey, DAMO Academy, Electrical Engineering and Systems Science - Signal Processing

Input Encoding and Positional Encoding

Network Modiﬁcations for Time Series

Applications of Time Series Transformers

Experimental Evaluation and Discussion

Future Research Opportunities

Ref

分类

链接

最新文章

标签