标签: Paper - 算法花园

2025-04-192025-04-19 随手记几秒读完 (大约106个字)

QR hashing (Shi et al., 2019) offers a solution by decomposing large matrices into smaller ones using quotient and remainder techniques while preserving embedding uniqueness across IDs. 通过使用商数和余数技术将大矩阵分解为较小的矩阵，同时保持不同 ID 嵌入的唯一性 #card

减少 embedding 词表大小和通过MurmurHash 这样的抗冲突哈希函数消除词表维护需求

Example of non static vocab hashing paradigm #card

Paper

2025-04-142025-04-14 随手记 2 分钟读完 (大约364个字)

自动特征交叉，解决特征稀疏

FM 与其他模型对比

可以模拟二阶多项式核的 [[SVM]]、MF、SVD++
- [[SVM]] 训练和预测需要计算核矩阵，核矩阵的复杂度是 N 方
- MF 扩展性没有 FM 好，只有 u 和 i 两类特征

与 [[SVM]] 对比

二阶多项式内核 SVM 二元交叉特征 wij 相互独立
- fm 参数 nk，svm 参数 nn，更适合大规模稀疏特征，泛化能力更强
核方法需要计算核矩阵
样本 :-> FM 预测和训练样本独立，SVM 和支持向量有关
FM 在原始形式下 进行优化学习，非线性SVM通常需要在 对偶形式 下进行
交叉项需不需要乘 value ？
eta 放到 xi 和 xj 泛化能力不好

FM 如何加入 index embedding？

对比 FM 和 SVM 有什么区别？

特征角度 :-> 二阶多项式内核 SVM 二元交叉特征 wij 相互独立
， ((6302f9ee-11be-4f7a-9cf9-26400d6d4601))
为什么要用 FTRL 优化 FM #card
FTRL 是 SGD 算法，离线调参，减少线上风险
稀疏特征，自适应学习率效果最好(特征 i 在 t 轮迭代的学习率不同)
不同特征有不同的学习速度、收敛速度快

[[Ref]]

深入FFM原理与实践 - 美团技术团队

Paper, Algorithm

2025-04-132025-04-19 随手记 3 分钟读完 (大约522个字)

@Embedding-based Retrieval in Facebook Search

除了主要的文本特征，还增加了user和doc的位置、社交关系的side info增强 query和doc 的匹配能力。
模型的训练目标#card
- 为双塔输出向量的距离，使正样本对距离尽可能小（相似度分数尽可能大），负样本对距离尽可能大（相似度分数尽可能小）。
- [[Triplet Loss]]

基线模型的样本构造也比较简单，使用query-doc的点击pair对作为正样本对，负样本有两种选择：#card

随机负采样：对每一个query随机从doc池中采样相应比例的负样本。
曝光未点击的样本：对于每一个query，随机从session内曝光未点击的样本作为负样本。
文中实验显示前者的效果明显强于后者，原因在于后者使得训练样本和后续预测样本有明显的分布不一致，即存在严重的样本选择偏差问题。

向量召回问题

候选集离线训练和线上服务的压力
matching 问题

[[新召回往往会存在后链路低估的问题，如何克服这个问题带来增量？]] #card

将召回生成的embedding作为ranking阶段的特征，可以直接将embedding作为特征或者计算query和doc的embedding各种相似度，通过大量实验证明，consine similarity有较好的结果。
为了解决向量召回准确率较低的问题，将向量召回的结果直接进行人工标注，然后再基于标注的结果进行训练。这种方法比较暴力并且效率比较低。

Ref

Paper

2025-03-142025-04-19 随手记 3 分钟读完 (大约375个字)

@LiRank: Industrial Large Scale Ranking Models at LinkedIn

想法

主要是工程实践经验，完全覆盖搜广推系统的方方面面。如果没有遇到过相关的问题，看起来完全是天书。当成是手册查询吧。

Large Ranking Models

[[3.1 Feed Ranking Model]]
[[3.2 Ads CTR Model]]
[[Residual DCN]]
[[Isotonic Calibration Layer in DNN]]
[[Dense Gating and Large MLP]]
[[3.6.Incremental Training]]
[[3.7 Member History Modeling]]
[[3.8 Explore and Exploit]]
[[3.9 Wide Popularity Features]]
[[3.10 Multi-task Learning]]
[[3.11 Dwell Time Modeling]]
[[3.12 Model Dictionary Compression]]
[[3.13 Embedding Table Quantization]]

Training scalability

[[4.1 4D Model Parallelism]] 解决训练过程中 embedding 表梯度同步存在性能瓶颈
[[4.2 Avro Tensor Dataset Loader]] 解决训练过程中 io 瓶颈
[[4.3 Offload Last-mile Transformation to Asynchronous Data Pipeline]] 优化训练过程
[[4.4 Prefetch Dataset to GPU]] 预取数据到 GPU

Experiments

[[5.1. Incremental Learning]]
- 两个场景增量学习效果
[[5.2 Feed Ranking]]
- 通过 replay metric 评估 3 中策略的效果
[[5.3 Jobs Recommendations]]
- [[Jobs Recommendations Ranking Model Architecture]]
- 验证 [[3.12 Model Dictionary Compression]] 压缩方法没有任何性能损失
- [[Dense Gating and Large MLP]] 并没有改进
5.4 Ads CTR
- 效果 #card
  - 基线 GDMix model

6 Deployment Lessons

6.1 Scaling up Feed Training Data Generation
- 没太看明白，感觉是优化性能实现用 100% sessions 进行训练
6.2 Model Convergence
- [[DCNv2]] 初始训练不收敛 #card
  - 对数值输入特征应用批量归一化，在当前训练步数下存在欠拟合，但是增加训练步数会导致实验速度下降。
  - 增加 warm-up steps 稳定训练，且可以使用三倍学习率

Paper

2024-10-052025-04-14 随手记 8 分钟读完 (大约1176个字)

A Survey of Transformers

[[PTM]] pre-train-model

17 年 Google 发表论文「Attention is all you need」提出 Transformers 框架，之后一大批人在此基础上进行研究和应用。原始 Transformer 改进的变体被称为「X-formers」。

X-formers 改进方向有三个：

Model Efficiency
- self-attetion 带来的计算量和参数量(内存)
  - sparse attention 轻量级注意力机制方案
  - divide-and-conquer methods 分治方法
Model Generalization
- 框架灵活，对数据没有太多的结构偏置
- 训练需要数据量大
- structural bias or regularization, pre-training on large-scale unlabeled data
Model Adaptation
- 将 Transformer 应用到具体的下游任务中。

背景知识

见 [[Transformer]]

模型使用形式

Encoder-Decoder
Encoder only
- classification or sequence labeling
Decoder only
- sequence generation
  - language modeling

根据对原始 Transformer 的改进分类：architecture modification, pre-training, and applications

architecture modification
- Module Level
  - [[Attention]]
    - 挑战
      - 计算复杂度，受序列长度影响
      - Structural prior 没有结构先验，在小数据集上容易过拟合
    - Sparse Attention
      - token i 和 j 有关系的情况下计算 attention，以稀疏矩阵形式保存
      - 如何定义关系
        
        position-based
        
        计算指定位置之间的 attention
        
        atomic sparse attention
        
        Global
        
        Band
        
        Dilated
        
        组合 atomic attention 得到更加复杂的attention计算规则
        
        content-based
        
        Routing Transformer
        
        Efficient Content-Based Sparse Attention with RoutingTransformers
        
        聚类
        
        [[Reformer]]
        
        使用 LSH，同一个分桶内的 token 计算 attention
    - [[Linearized Attention]]
      - QKV 计算 Attention 的复杂度是 $$O(T^2D)$$，通过引入核函数降低到 $$O(TD)$$
      - key components
        
        kernel feature map
        
        Performer 用其他函数去拟合 attention 函数
        
        FAVOR+ Fast Attention Via positive Orthogonal Random features approach
        
        aggregation rule
    - Query Prototyping and Memory Compression
      - 减少 queries or key-value pairs
      - Query Prototyping 计算关键 query 的 attention 值，剩余部分填充或者采用均匀分布填充
        
        [[@Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting]]
      - Memory Compression 减少 kv 数量
    - Low-rank Self-Attention
      - attention 矩阵线性相关的行数 A 远小于输入 T
      - Low-rank Parameterization
      - Low-rank Approximation
    - Attention with Prior
      - 分成 generated attention 和 prior attention 两部分，下面的方法都是生成 prior attention 尝试
      - Prior that Models locality
        
        文本之类的数据对位置敏感，使用 i 和 j 的位置，结合[[Normal Distribution]]计算先验信息
      - Prior from Lower Modules
        
        使用之前层的注意力分布结果
        
        Realformer 将 [[ResNet]] 结构放到 Attention 矩阵中
        
        Lazyformer 每两层计算一次 Attention 矩阵
      - Prior as Multi-task Adapters
        
        多任务适配器，看起来是在共享参数
      - Attention with Only Prior
        
        只使用先验
    - Improved Multi-Head Mechanism
      - Head Behavior Modeling
      - Multi-head with Restricted Spans
        
        观察到原始中部分 head 关注局部，部分关注全局
        
        限制 attention 的范围(通过距离实现)
        
        decoder 中 mask-multi-head 就是这个思路
      - Multi-head with Refined Aggregation
        
        多头的结果如何合并
        
        routing methods
      - Other Modifications
        
        Shazeer multi-query attention 所有头之间共享 kv
        
        Bhojanapalli 灵活设置 head size
  - OTHER MODULE-LEVEL MODIFICATIONS
    - Position Representations
      - Transformer 具有排列不变性，需要而外位置信息
      - Absolute Position Representations
        
        正余弦编码
        
        位置向量
      - Relative Position Representations.
        
        token 之间的关系更加重要
        
        将 embedding 加到 key 的attention中
        
        Transformer-XL
      - Other Representations
        
        TUPE
        
        混合相对和绝对位置
        
        Roformer
        
        旋转位置编码
        
        线性 attention 中实现相对位置编码
        
        Position Representations without Explicit Encoding 不要编码
        
        R-Transformer 先过 RNN 再将输出结果输入到多头
        
        CPE 使用卷积
        
        Position Representation on Transformer Decoders
        
        移除 pe
    - LayerNorm
      - Placement of Layer Normalization
        
        post-LN
        
        pre-LN 保证 skip 链接路上没有其他操作
      - Substitutes of Layer Normalization
        
        可学习参数效果不好，
        
        AdaNorm
        
        scaled l2 normalization
        
        PowerNorm
      - Normalization-free Transformer
        
        ReZero 可学习残差模块替代 LN
    - FFN
      - Activation Function in FFN
        
        [[Swish]]
        
        [[GPT]] [[GELU]]
      - Adapting FFN for Larger Capacity
        
        product-key memory layers
        
        MoE
        
        取 top k 专家
        
        取最大专家
        
        分组取各自 top1
      - Dropping FFN Layers
        
        简化网络
- Arch. Level
  - Adapting Transformer to Be Lightweight
    - Lite Transformer
    - Funnel Transformer
      - hidden sequence pooling and up-sampling
    - DeLighT
      - DeLighT block
  - Strengthening Cross-Block Connectivity
    - 针对 decoder 解决问题
    - Transparent Attention
    - Feedback Transformer
      - 使用前一步所有层的信息
  - [[Adaptive Computation Time]]
    - 解决之前模型中层数固定
    - 三种方法
      - [[Universal Transformers]]dynamic halting
        
        达到停止条件的 token 不再改变
      - CCT
        
        跳层
  - Transformers with Divide-and-Conquer Strategies
    - 将 LM 任务中长文本拆分成多个片段
    - Recurrent Transformers 上一个 T 输出信息输入到下一个输入
      - Transformer-XL 上一个输出和下一个输入 concat 在一起
    - Hierarchical Transformers 多个结果聚合
      - Hierarchical for long sequence inputs
        
        sentence Transformer and document Transformer
      - Hierarchical for richer representations 更丰富的表示
        
        字母级别表示和词级别表示
  - Exploring Alternative Architecture
    - NAS
- PRE-TRAINED TRANSFORMERS
  - Encoder only
    - [[BERT]]
  - Decoder only
- APPLICATIONS OF TRANSFORMER
  - CV
    - [[ViT]]
- CONCLUSION AND FUTURE DIRECTIONS
  - 理论分析
  - 更好全局交互机制
  - 处理多种类数据的框架

[[Layer Normalization]]

Paper, Algorithm, Survey

2024-10-052024-10-05 随手记几秒读完 (大约0个字)

A Transformer-based Framework for Multivariate Time Series Representation Learning

Paper, representation, learning

2024-10-052024-12-19 随手记 7 分钟读完 (大约1058个字)

BERT

[[@BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]]

代码：google-research/bert: TensorFlow code and pre-trained models for BERT

大模型 + 微调提升小任务的效果

输入层

词嵌入（token embedding）、位置嵌入（position embedding）段嵌入（segment embedding）
- 预训练任务包含判断 segment A 和 segment B 之间的关系
模型结构 12 层，每层 12 个 multi-head
CLS 句子开头，最后的输出 emb 对应整句信息
- 无语义信息的符号会更公平地融合文本中各个词的语义信息，从而更好的表示整句话的语义
SEP 句子之间分割

BERT

L=12 H=768 A=12, Total Parameters=110M
L=24 H=1024 A=16, Total Parameters=340M

两种 NLP 预训练

1. 产出产品，例如 word2evc 的 embedding
1. 做为骨架接新结构

[[ELMo]]

使用 LSTM 预测下一个单词

[[GPT]]

Transformer
单向

-w1304

贡献性

双向信息重要性

模型输入：

1. Token emb
1. Segment emb(A B) 针对 QA 或者两个句子的任务
1. Position emb

训练方式

[[Masked-Language Modeling]] :->mask 部分单词，80 % mask，10 % 错误单词， 10% 正确单词
- 目的 :-> 训练模型记忆句子之间的关系。
  - 减轻预训练和 fine-tune 目标不一致给模型带来的影响
[[Next Sentence Prediction]] :-> 预测是不是下一个句子
- 句子 A 和句子 B 有 50% 的概率是上下文
- 解决后续什么问题 :-> QA 和自然语言推理
  
  occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTczNDYxNjMzODQyMV8wLnBuZyI6eyJjb25maWciOnt9LCJlbGVtZW50cyI6W3sibGVmdCI6MzY3LjEzMDExNTk3NDg1MjYsInRvcCI6NTkuNDE3NTUwMDkwNDM3Mzk1LCJ3aWR0aCI6NjIzLjU5MTg3MjM5Mzc2MDksImhlaWdodCI6MTE4LjgzNTEwMDE4MDg3NDc2LCJhbmdsZSI6MCwiY0lkIjoxfSx7ImxlZnQiOjEwODEuOTAzNDAxNTY2MDE5LCJ0b3AiOjY1LjA2OTA2NDM1NDU0MTcsIndpZHRoIjo2NjUuMjAzOTI0MjY0NDI2LCJoZWlnaHQiOjkwLjM5NTU1NzkzMTAxMTA3LCJhbmdsZSI6MCwiY0lkIjoyfV19fQ==
  [[激活函数]] [[GELU]]
和 [[GPT]] 一致，为什么？

优化器

不完整版 adam
fine tune 时可能不稳定，需要换成正常版 adam

fine tune

根据任务调整输入和增加预测结构，使用相关数据训练
使用 fine tune 比将bert做为特征放到模型中效果要好
1. 双句分类
1. 单句分类
- CLS 后接 softmax
1. 预测一个 start 和 end embedding，然后和 T 计算 softmax 取概率最大的做为开始和结束的位置
1. 实体标注

研究取不同的 embedding 效果

缺陷

不擅长生成类任务(机器翻译、文本摘要)

[[Ref]]

[[Multimodal BERT]]
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time
如何评价 BERT 模型？ - 知乎
NLP 从语言模型看Bert的善变与GPT的坚守 - 知乎
- 像Bert这样的双向语言模型为何要做 masked LM？[[GPT]] 为何一直坚持单向语言模型？ Elmo 也号称双向，为何不需要 mask？[[Word2Vec]] 的 CBOW 为何也不用 mask？
- indirectly see themselves
- GPT 保留用上文生成下文的能力
为什么 Bert 的三个 Embedding 可以进行相加？ - 知乎
- 三个 embedding 相加和拼接
  - 联系 :-> 三个 embedding 相加相当于三个原始的 one-hot 拼接再经过一个全连接网络。
    - 优点 :-> 和拼接相比，相加可以节约模型参数。
    - 实验显示拼接并没有相加效果好，拼接后维度增加，需要再经过一个线性变换降低维度，增加了更多参数。
- 之前的理解和多个波长不同的波相加，最后还是能分离出来，所以模型也应该能区分。
- 空间维度很高，模型能区分各个组分
  - 参数空间量 30k2512
  - 模型表达能力至少是 2^768
- 梯度角度，(f + g +h)' = f' + g' + h'
BERT—容易被忽视的细节
- 细节三：对于任务一，对于在数据中随机选择 15% 的标记，其中80%被换位[mask]，10%不变、10%随机替换其他单词，原因是什么？#card
  - [mask] 在 fine-tune 任务中不会出现，模型不知道如何处理。
  - 缓解上面的现象
  - 15% 标记被预测，需要更多训练步骤来收敛

Paper, Algorithm, Google

2024-10-052024-12-08 随手记 2 分钟读完 (大约252个字)

DART

主要思想 :-> 每次新加的树要拟合并不是之前全部树 ensemble 后的负梯度，而是随机抽取一些树 ensmeble 后的负梯度。

解决 GBDT over-specialization 问题
- 问题现象 :-> 前面迭代树对预测值的贡献比较大，后面的树会集中预测一小部分样本的偏差
- 常规方法 :-> Shrinkage
  算法流程图
S1 :-> 训练数据集
T1 :-> 使用 S1 数据训练得到决策树
针对决策树 2 到 N #card #incremental
- 从 M 中随机抽取决策树集合 D， $\hat{M}$ 是 M 和 D 的差集
- 利用 $\hat{M}$ 计算样本负梯度，得到数据集 St
- 利用 St 训练 Tt
- 调整 Tt 的权重
  - 负梯度只有 $\hat{M}$ 树得到，实际上这个少的负梯度由 Tt 和 D 中的树共同拟合，所以需要对 T_t 缩小 D+1 倍
- 调整 D 中其他树的权重

[[lightgbm 使用记录]] Early stopping is not available in dart mode

Paper, LightGBM

2024-10-052025-04-29 随手记 3 分钟读完 (大约453个字)

DCN

在 Wide & Deep 基础上，对 Wide 部分进行改进。LR 无法进行特征交叉，FM 受限于性能一般只去做二阶交叉，Cross 可以实现高阶交叉。DCN 和 DNN 相比，相同效果情况下可以减少参数量。

Cross 网络只能处理定长的输入，[[ETA]] v4 中无法使用……

特征处理和常规的模型一样，Sparse feature 经过 embedding 处理，然后和 Dense feature concat 在一起。由于 Cross network 每一层的大小都和输入向量大小相等，如果 Sparse feature 不处理，输入维度会很大，然后参数量会增加。

Cross 和 Deep 的输出结果 concat 后过一个 LR 直接输出。

Cross Network

每一层都是由 $x_0$ 和前一层的输出 $x_l$ 交叉学习残叉。
ResNet 的引入可以将网络做的更深。
特点：有限高阶、自动叉乘、参数共享
$\mathbf{x}_{l+1}=\mathbf{x}_{0} \mathbf{x}_{l}^{T} \mathbf{w}_{l}+\mathbf{b}_{l}+\mathbf{x}_{l}=f\left(\mathbf{x}_{l}, \mathbf{w}_{l}, \mathbf{b}_{l}\right)+\mathbf{x}_{l}$ #card
- 图中可以看到随着层数的增加，参数 w 会线性增加。

Deep NetWork

Cross Network 中的参数量太少，不能学习高维的非线性特征。

Analysis

Cross 的设计最后包含了每个特征的从 1 阶到高阶的特征组合。与 FM 不同每个特征组合的参数部分共享，所以能降低参数量，比 FM 有更好的泛化性和鲁棒性。比如 FM 可以解决 xi 和 xj 没有同时出现过的情况，但是 Cross 能处理 xi 和 xj 都没有出现过的情况 …… #card

Paper, Algorithm

2024-10-052024-10-05 随手记几秒读完 (大约25个字)

Deep Learning for Click-Through Rate Estimation

Ref

CTR预估模型有怎样的发展规律？ - 知乎 (zhihu.com)

Paper, CTR, Survey

2024-10-052025-04-29 随手记 3 分钟读完 (大约520个字)

FTRL

FTL Follow The Leader 在线学习的一种思路 #card

为了减少单个样本的随机扰动，每次找到让之前所有损失函数之和最小的参数。
$w=\operatorname{argmin}_{w} \sum_{i=1}^{t} f_{i}(w)$
FTRL 带正则项的 FTL 算法 #card
$w=\operatorname{argmin}_{w} \sum_{i=1}^{t} f_{i}(w)+R(w)$
通过代理损失函数求解

[[稀疏性]] 模型稀疏好处

减少预测内存和复杂度，大量参数是零
利用 L1 正则不仅能获得稀疏，而且能降低模型过拟合带来的风险
稀疏模型，相对来说可解释性更好。

为什么 SGD 不一定能保证模型的稀疏性？#card

不同于 Batch，Online 中每次的更新并不是沿着全局梯度进行下降，而是沿着某个样本的产生的梯度方向进行下降，整个寻优过程变得像是一个“随机” 查找的过程(SGD 中 Stochastic 的来历)，这样 Online 最优化求解即使采用 L1 正则化的方式，也很难产生稀疏解。

数据集规模大，每一次计算全局梯度的代价变得过高，完成训练时间会变得很长。

在线学习：每次处理一个样本，处理过的样本会被丢弃。

特点 #card

每个特征一个学习率([[Adam]]中也实现了)
收敛速度快
L1 正则引入稀疏性，L2 正则引入平滑 [[弹性网络回归]]

How they choose to center the additional strong convexity used to guarantee low regret: RDA centers this regularization at the origin, while FOBOS centers it at the current feasible point. 结合[[FOBOS]]高精度以及 RDA 较好的稀疏性

How they handle an arbitrary non-smooth regularization function $\Psi$ . This includes the mechanism of projection onto a feasible set and how $L_1$ regularization is handled.

Ref

各大公司广泛使用的在线学习算法FTRL详解 - EE_NovRain - 博客园 (cnblogs.com) 包含部分工程细节

Paper

2024-10-052024-10-05 随手记 8 分钟读完 (大约1160个字)

MoCo

Summary

填补 CV 领域有监督学习和无监督学习的差距

Abstract

dictionary look-up
- 提出基于队列+动量对比用于无监督的表征学习。

Introduction

创新点：用队列表示字典
- 什么样的字典才适合对比学习？
  - (i) large
    - 从连续高维空间做更多的采样，字典 key 越多，表示的信息越丰富
    - 字典小，key 少，模型泛化能力弱
  - (ii) consistent
    - 字典中的 key 应该用相同或相似的编码器生成
    - 如果key是使用不同编码器得到的，查询时可能找到与 query 使用相同或相似编码器生成的key，而不是语义上相似的 key
无监督在 CV 领域不成功的原因
- 原始信号空间的不同
- NLP 原始信号是离散的，词、词根、词缀，容易构建 tokenized dictionaries 做无监督学习
  - tokenized: 把一个词对应成某一个特征
  - Why tokenized dictionaries 有助于无监督学习？
  - 把字典的 key 认为是一个类别，有类似标签的信息帮助学习
  - NLP 无监督学习很容易建模，建好的模型也好优化
- CV 原始信号是连续的、高维的，不像单词具有浓缩好的、简洁的语义信息，不适合构建一个字典
- 如果没有字典，无监督学习很难建模
无监督学习主要的两个部分：
- pretext tasks 代理任务
  - 学习更好的特征表示
  - 常见代理任务
    - denoising auto-encoders 重建整张图
    - context auto-encoders 重建某个 patch
    - cross-channel auto-encoders (colorization) 给图片上色当自监督信号
    - pseudo-labels 图片生成伪标签
    - exemplar image 给同一张图片做不同的数据增广，它们都属于同一个类。
    - patch ordering 九宫格方法：打乱了以后预测 patch 的顺序, or 随机选一个 patch 预测方位 eight positions
    - 利用视频的顺序做 tracking
    - 做聚类的方法 clustering features
- loss functions
  - 衡量模型预测结果和固定目标的差异
  - L1 or L2 Loss
    - Auto-encoder
  - 判别式网络
    - eight position
      - 图片 9 等分，判断选出的图片位于中间图片的什么方向。
  - 对比学习损失：目标不固定，训练过程中不断改变
  - 对抗学习损失：衡量两个概率分布之间的差异

Method

代理任务 instance discrimination 个体判别

+ 一张图片 $$x_{i}$$ 经过翻转+裁剪等方法得到 $$x_{i1}$$ 和 $$x_{i2}$$，这两张图片做为正样本，其他图片做为负样本。

+ $$x_{i1}$$ 是  anchor

+ $$x_{i2}$$ 是 positive

+ 编码器 E11 和 E12 可以相同，也可以不同。

+ 对比学习怎么做？

  + f11 和 f12 接近，和其他样本远离

    + matching key and dissimilar to others

    + Learning is formulated as minimizing a contrastive loss

  + f11 当成是 query，在字典中查询接近的 key

如何构建大 + 一致的字典
- 基于队列的字典
  - 摆脱 batch size 的限制
  - 用队列大小限制字典大小
- 基于动量的编码器
  - $\theta_{\mathrm{k}} \leftarrow m \theta_{\mathrm{k}}+(1-m) \theta_{\mathrm{q}}$
  - momentum encoder 由当前的编码器初始化得到
  - 动量 m=0.999 比较大是，动量编码器更新缓慢。尽可能保证队列的 key 由相似的编码器生成
和之前方法对比
- end-to-end 牺牲大
  - 负样本大小等于 batch size 大小
- memory-bank 牺牲一致性
  - 采样得到负样本
- moco
  - encoder 基于梯度更新
  - momentum encoder 基于 encoder 进行动量更新

目标函数 [[InfoNCE]]
- $\mathcal{L}_{q}=-\log \frac{\exp \left(q \cdot k_{+} / \tau\right)}{\sum_{i=0}^{K} \exp \left(q \cdot k_{i} / \tau\right)}$
- 计算 q 和 k 的点积判断两个样本之间的相似度
- $\tau$$ 控制分布图形，越大越关注困难样本$

Experiments

7 个检测 + 分割任务
linear protocol
- 预训练模型，应用时只改变最后的全连接层。
- backbone 做为特征提取器
- 对比预训练的效果好不好

Conclusion

1000 倍数据增加，moco 性能提升不高
尝试NLP中其他代理任务 masked auto-encoding
- [[Masked-Language Modeling]]
- 见 [[Masked Autoencoders Are Scalable Vision Learners]]
总结
- 去构造一个大的字典，从而让正负样本能够更有效地去对比，提供一个稳定的自监督信号，最后去训练这个模型

Paper

2024-10-052024-10-05 随手记 3 分钟读完 (大约419个字)

NFM

$\hat{y}_{N F M}(\mathbf{x})$ :-> $w_{0}+\sum_{i=1}^{n} w_{i} x_{i}+f(\mathbf{x})$

第一项和第二项是线性回归
引入第三项神经网络学习 :-> 数据之间的高阶特征
- 网络输入 :-> FM 模型的二阶特征交叉结果
- 与直接使用高阶 FM 模型相比 :-> 可以降低模型的训练复杂度，加快训练速度。
  NFM 的神经网络部分包含 4 层，分别是 Embedding Layer、Bi-Interaction Layer、Hidden Layers、Prediction Score。

tags:: #[[Model Architecture]]

Embedding Layer 层对输入的稀疏数据进行 Embedding 操作。最常见的 Embedding 操作是在一张权值表中进行 lookup ，论文中作者强调他们这一步会将 Input Feture Vector 中的值与 Embedding 向量相乘。

Bi-Interaction Layer 层是这篇论文的创新，对 embedding 之后的特征两两之间做 element-wise product，并将结果相加得到一个 k 维（Embeding 大小）向量。这一步相当于对特征的二阶交叉，与 FM 类似，这个公式也能进行化简：

$f_{B I}\left(\mathcal{V}_{x}\right)=\sum_{i=1}^{n} \sum_{j=i+1}^{n} x_{i} \mathbf{v}_{i} \odot x_{j} \mathbf{v}_{j} =\frac{1}{2}\left[\left(\sum_{i=1}^{n} x_{i} \mathbf{v}_{i}\right)^{2}-\sum_{i=1}^{n}\left(x_{i} \mathbf{v}_{i}\right)^{2}\right]$
Hidden Layers 层利用常规的 DNN 学习高阶特征交叉

Prdiction Layer 层输出最终的结果：

实验结果： ![](https://media.xiang578.com/15643059963915.jpg) tags:: #HOFM

Paper, Algorithm

2024-10-052024-10-05 随手记几秒读完 (大约27个字)

Neural Collaborative Filtering vs. Matrix Factorization Revisited

Ref

点积 vs. MLP：推荐模型到底用哪个更好？ - 知乎 (zhihu.com)

Paper, CTR

2024-10-052024-10-05 随手记几秒读完 (大约16个字)

Neural Machine Translation by Jointly Learning to Align and Translate

模拟人类视觉机制

Paper, Attention

2024-10-052025-04-29 随手记 6 分钟读完 (大约842个字)

Parameter Server

什么是 PS ？

分布式进行梯度下降的计算完成参数的更新与最终收敛
和 [[Spark MLib]] 一样数据并行训练产生局部梯度，再汇总梯度更新参数权重的并行化训练方案

参数服务面临的挑战

访问参数需要的大量带宽
算法需要有序更新参数，不同服务器之间同步带来的时延
训练容灾

PS 重要特征

异步同行
灵活的一致性模型
弹性扩展
容灾
方便使用

并行梯度下降流程

任务管理器
- 分发数据到 workder
Worker
- 初始化
  - 加载训练数据
  - 从 server 节点拉取参数
- 迭代
  - 根据本节点训练数据计算梯度
  - 将计算好的梯度 push 到 servers 节点
  - pull 最新需要用到的参数
- Servers
  - 汇总 m 个 worker 计算出的梯度成总梯度
  - 利用总梯度和正则化项梯度，计算新参数

物理架构

server group 管理，每个 server 维护部分参数
- 支持自定义梯度更新方式
Range Push 和 Pull
- 小批量更新参数

一致性和并行效率之间的取舍

同步阻断式
- 等 master 汇总全部梯度，重新计算模型新参数后才开始下一轮计算
异步非阻断式
- 每一轮更新没有关联
最大延迟
- 新的参数没有获取到时，使用旧的参数计算梯度
- 指定 X 轮迭代必须等待更新参数

Vector Clock

记录每个 worker 每个 range 对应的参数时间

多 server 节点的协同和效率问题

server group 管理，每个 server 维护部分参数
- 通过[[一致性哈希]]计算参数位置以及分配对应的服务器
- 服务器 S1 会计算 S1 对应的参数，也会备份之后几个 server 对应的参数
- 增加节点相当于将 range 分裂
- 删除节点可以让临近节点负责

server 在汇总多个 workder 的结果之后广播

总结

用异步非阻断式的分布式梯度下降策略代替同步阻断式的梯度下降策略
实现多 server 节点的架构，避免单 master 节点带来的带宽瓶颈和内存瓶颈
使用[[一致性哈希]]，range pull 和 range push 等工程手段实现信息的最小传递，避免广播操作带辣的全局性网络阻塞和带宽浪费

问题

各 worker 之间如何同步？
如何解决 DNN weight
- AllReduce 完成 dnn weight 在各 worker 节点同步
- feature embedding 使用纯异步 ASP 模式，dnn weight 使用纯同步 SSP 模式

worker 之间的并行策略

BSP（Bulk Synchronous Parallel） #card

SSP（Stalness Synchronous Parallel）#card

ASP（Asynchronous Parallel）#card

实现

基于ps-lite实现分布式算法
阿里巴巴的XDL
快手的Persia

Ref

Paper

2024-10-052024-10-05 随手记 2 分钟读完 (大约280个字)

ResNet

单纯堆积卷积层，并不能让模型表现的更好。

vanishing/exploding gradients

离输入近的网络层会产生梯度消失现象，比较难训练，接到靠近输出的层。
使用 Residual Block

Deep Residual Learning for Image Recognition

学习 residual mapping 比 original unreferenced mapping 轻松
identity mapping 给模型提供 shortcuts，如果 block 前后输入输出大小不同，可以通过 w 参数转化
在加法之后过第二个非线性单元

bottleneck architectures
为了解决层数变多时，参数数量增加问题。通过 bottleneck 结构，减少维持和左边相同的参数量，然后网络变成 3 层

[[Identity Mappings in Deep Residual Networks]]

[[Residual Networks Behave Like Ensembles of Relatively Shallow Network]]

[[ResNet/Question]]

[[Ref]]

残差网络解决了什么，为什么有效？ - 知乎
给妹纸的深度学习教学(4)——同Residual玩耍 - 知乎里面有解释 ResNet 的 ReLU 放在哪里的原因
对ResNet本质的一些思考 - 知乎 (zhihu.com)

Paper, Algorithm

2024-10-052024-10-05 随手记几秒读完 (大约70个字)

RealFormer

Layer Normalization

On Layer Normalization in the Transformer Architecture

PostLN

Informer：把残差转移到Attention矩阵上面去 - 科学空间|Scientific Spaces

Which Training Methods for GANs do actually Converge?
- 残差每一步累积导致方差很大从 $$x+f(x)$$ 变成 $$x+\alpha f(x)$$

Paper, Algorithm

2024-10-052024-10-05 随手记 2 分钟读完 (大约273个字)

TCN

TCN 中输入和输出可能有不同的宽度，c 图表示使用 11 卷积调整输入大小
- 也可以直接通过 zero padding 来增加 channels

TCN = 1D FCN + causal convolutions

特点

使用因果卷积，不会泄漏未来信息。
- 论文中强调和 RNN 之类方法进行对比，所以要考虑因果。
可以取任意长度的序列，并将其映射到相同长度的输出序列。
引入 [[ResNet]] 和扩张卷积的组合可以将网络做深以及增加感受野。

细节

tcn 中没有 pooling 层
normalization 方法是 weight norm，更适合序列问题

增加感受野的方法

更大的 kernel_size (增加参数，卷积核大效果差，卷积核过大会退化成一个全连接层)
[[空洞卷积]]

时序问题

1. 输入和输出矩阵大小相同
1. 不能使用没有发生时刻的信息，因果卷积

[[ETA 模型]] 实现

tf.nn.conv1d(input, filters, stride, padding, data_format='NWC', dilations=None, name=None)

Paper, Algorithm, Google, CNN

2024-10-052024-10-05 随手记几秒读完 (大约49个字)

Towards Understanding Ensemble, Knowledge Distillation, and Self-Distillation in Deep Learning

Ref

Paper

2023-07-192024-10-05 随手记 6 分钟读完 (大约967个字)

@A Consumer Compensation System in Ride-hailing Service

[[Attachments]]

A Consumer Compensation System in Ride-hailing Service_2023_Yu.pdf

代驾和货运的补贴系统

价格弹性建模 a transfer learning enhanced uplift modeling is designed to measure the elasticity
ls-type:: annotation
hl-page:: 1
hl-color:: yellow
预算分配 a model predictive control based optimization is formulated to control the budget accurately
ls-type:: annotation
hl-page:: 1
hl-color:: yellow

系统目标：在预算范围内，通过补贴最大化平台收入。

Given a total compensation budget or an average compensation rate, find an optimal policy to subsidize queries so that the overall revenue is maximized.
ls-type:: annotation
hl-page:: 2
hl-color:: yellow

难点

如何用历史数据建模用户弹性 Consumer elasticity
ls-type:: annotation
hl-page:: 2
hl-color:: yellow
个保法下公平原则（不同用户相同 odt 补贴相同） Consumer fairness
ls-type:: annotation
hl-page:: 2
hl-color:: yellow
如何建模线上随机的发单请求 Randomness in queries:
ls-type:: annotation
hl-page:: 2
hl-color:: yellow

Transfer Learning Enhanced Uplift Modeling
ls-type:: annotation
hl-page:: 2
hl-color:: yellow

常规训练 uplift 模型需要大量随机补贴下的响应数据（成本高），本文方法使用大量线上观测数据（有偏，受线上策略影响）和少量随机补贴数据训练模型。
DNN + GBDT：解决 tabular input space and transfer learning
ls-type:: annotation
hl-page:: 2
hl-color:: yellow
- 超过 90% 特征是 dense numerical feature ，需要用 GBDT建模，但是 GBDT 不好 fine-tuning 新数据以及处理稀疏特征。
- 训练 s-learner model
  - [:span]
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow
  - 两个 XGB 模型分别用观测数据 observational data 和随机数据 RCT data 训练，目标是二分类（用户是否下单）。
  - 数据过两个 XGB 模型得到叶子信息，再过 embedding 层，concat 两个 embedding 过 inner 层。
    - 先用 observational data 训练整个网络 Massive observational data is first fed into both inputs to pre-train the model
      ls-type:: annotation
      hl-page:: 3
      hl-color:: red
    - RCT data 用另外一个输出层训练 RCT data is used to fine-tune using a different output layer
      ls-type:: annotation
      hl-page:: 3
      hl-color:: red
    - fine-tuning 时使用 early stopping

Optimization Formulation
ls-type:: annotation
hl-page:: 3
hl-color:: yellow

订单聚类成 OD 网格
- 网格内历史订单平均弹性作为网格弹性 use the mean of the historical query-wise elasticity to forecast the class-wise elasticity
  ls-type:: annotation
  hl-page:: 3
  hl-color:: yellow
model predictive control (MPC) technology
ls-type:: annotation
hl-page:: 3
hl-color:: yellow
建模成最优化问题求解分配方案

线上系统：离线生成补贴词典供线上使用

[:span]
ls-type:: annotation
hl-page:: 3
hl-color:: yellow

离线实验

Uplift 模型
- 特征
  - The features include query information (e.g., the origin, destination, time, weekday, and distance), spatial features (e.g., point of interest information, and order statistics in the same cells), subsidy information, and trading features (e.g., historical order placement rate). I
    ls-type:: annotation
    hl-page:: 4
    hl-color:: yellow
- 模型细节
  - ob data xgb，35 棵树，1120 个叶子节点
  - rct xgb，51 棵树，1314 叶子节点
  - embedding size 8
  - The size of the common inner layers and output layer is set to 128, 64, and 32
    ls-type:: annotation
    hl-page:: 4
    hl-color:: green
- 结果分析
  - T-XGB+DNN AUUC 效果比 S-XGB+DNN 效果好，说明需要两棵树去提取特征？
    - S-XGB+DNN：a single GBDT distiller DNN
    - T-XGB+DNN：two-distiller GBDT distiller DNN
  - [:span]
    ls-type:: annotation
    hl-page:: 4
    hl-color:: yellow
优化结果评估
- 假设 uplift 模型结果是真值，评估不同分配策略的影响。
- No Cluster Oracle
  ls-type:: annotation
  hl-page:: 4
  hl-color:: yellow
  不对订单聚类，考虑用户特征。
- Open Loop 用前 14 天数据预测后 7 天
- 新系统补贴率低但是更高利润 Compared with the baseline, our system obtains a lower subsidy rate and higher revenue, for its accurate compensation, to achieve a higher ROI.
  ls-type:: annotation
  hl-page:: 4
  hl-color:: yellow

一些问题？

为什么不是常规构建 uplift 模型的方法（实验组 + 空白对照组）？
T-XGB 和 S-XGB 具体怎么训练？
为什么 rct 树的数量比 ob 树多？从样本角度 ob 树样本更多
uplift 没有给纯 xgb 的

Paper, SIGIR/2023, Dynamic Pricing, 2024, 已读

2023-03-182023-03-19 智能路 18 分钟读完 (大约2691个字)

【时间序列预测】Are Transformers Effective for Time Series Forecasting?

香港中文大学曾爱玲文章，在长时间序列预测问题上使用线性模型打败基于 Transformer 的模型，并对已有模型的能力进行实验分析（灵魂7问，强烈推荐好好读一下！）。

Paper, Transformer, Time Series Forecasting

2023-03-122023-03-12 智能路 12 分钟读完 (大约1819个字)

【滴滴 HierETA】Interpreting Trajectories from Multiple Views A Hierarchical Self-Attention Network for Estimating the Time of Arrival

滴滴和华南理工在 2022 年 KDD 上发表的 ETA 论文，从多个视角解释轨迹，引入 Hierarchical Self-Attention Network 方法进行建模，最终在滴滴内部数据集上获得指标提升。

Paper, eta, didi, self-attention, KDD

2022-10-172025-03-15 随手记 20 分钟读完 (大约2956个字)

@DuETA: Traffic Congestion Propagation Pattern Modeling via Efficient Graph Learning for ETA Prediction at Baidu Maps

[[Attachments]]

DuETA_2022_Huang.pdf

核心贡献

新颖性：通过 route-aware graph transformer 捕捉拥堵敏感图中长距离相关性，建模拥堵传播模式
- The design of DuETA is driven by the novel ideas that directly capture the long-distance correlations through a congestion-sensitive graph, and that model traffic congestion propagation patterns via a route-aware graph transformer.
  ls-type:: annotation
  hl-page:: 2
  hl-color:: yellow
- 捕捉任意两个（距离很远，但是在路况状态上很相关的）segment 之间的交互
  - These designs enable DuETA to capture the interactions between any two road segment pairs that are spatially distant but highly correlated with traffic conditions.
    ls-type:: annotation
    hl-page:: 2
    hl-color:: yellow
通过学习交通拥堵传播模式可以有效提高 ETA 预测效果
- traffic congestion propagation patterns
  ls-type:: annotation
  hl-page:: 2
  hl-color:: yellow

核心问题

业务需求
- 预测的未来路况状态和真实状态不一致会导致 ETA 误差传播 we observed that a propagation of ETA errors arises from the sharp inconsistency between the predicted traffic condition in the future and ground truth.
  ls-type:: annotation
  hl-page:: 2
  hl-color:: green
- 建模 traffic congestion propagation patter
  - Traffic congestion propagation pattern modeling is challenging, and it requires accounting for impact regions over time and cumulative effect of delay variations over time caused by traffic events on the road network.
    ls-type:: annotation
    hl-page:: 1
    hl-color:: red
  - 当前交通拥堵路段会影响路网上相邻道路的通行能力 As illustrated in it, the impact regions and cumulative delays over time caused by traffic congestion (the road segments in red) would inevitably affect all the interdependent segments on the road network.
    ls-type:: annotation
    hl-page:: 2
    hl-color:: green
    - 用户请求 ETA 时，只有 3-hop 拥堵，但是由于拥堵传播，等用户到达 target 时，2-hop 拥堵，部分y1-hop 缓行
    - 之前使用 [[STGNN]] 类方法建模直接相邻的路段 existing studies have applied spatial-temporal graph neural networks (STGNNs)[7 , 8 , 21 , 34, 35 , 38 ] to model traffic conditions
      ls-type:: annotation
      hl-page:: 2
      hl-color:: blue
      存在两个问题
      - 没有直接建模路网上不相邻 segment 的远距离相关性，网络传播过程中会有信息损失 The long-distance correlations of indirectly connected road segments are not explicitly modeled, which inevitably suffer from information loss during the multi-step message passing.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: blue
      - 由于 STGCNN 方法计算的复杂度，大部分时候补数很少。两个距离较远的 segment 的路况状态特征不能很好传递。 Traffic conditions are not sufficiently transmitted between two road segments that are spatially distant, because they typically execute only a few steps of message passing (one step in most cases), due to the computational complexity of STGNNs.
        ls-type:: annotation
        hl-page:: 2
        hl-color:: blue
    - [:span]
      ls-type:: annotation
      hl-page:: 2
      hl-color:: green
面临挑战
- ETA 任务需要建模 contextual and predictive factors, such as spatial-temporal interaction, driving behavior, and traffic congestion propagation inference
  ls-type:: annotation
  hl-page:: 1
  hl-color:: green
- 路网中新 segment 和未知区域 we plan to investigate the transferability of our model to deal with unseen road segments or regions.
  ls-type:: annotation
  hl-page:: 9
  hl-color:: yellow
- 路线旁边 poi 的影响 Second, given the observation that the travel times of some routes have a considerable correlation with the POIs distributed along the roads.
  ls-type:: annotation
  hl-page:: 9
  hl-color:: yellow
  - 特定地点特定时间
  - poi 密集区域对 eta 预测影响 To address this issue, we plan to utilize the POI retrieval system [5, 11, 13] as an auxiliary tool to forecast which POIs would be densely populated and how extensively they would affect the ETA prediction.
    ls-type:: annotation
    hl-page:: 9
    hl-color:: yellow
  - TODO 待找 poi 相关
    - MetaLearned Spatial-Temporal POI Auto-Completion for the Search Engine at Baidu Maps.
    - Personalized Prefix Embedding for POI Auto-Completion in the Search Engine of Baidu Maps
    - HGAMN: Heterogeneous Graph Attention Matching Network for Multilingual POI Retrieval at Baidu Maps

相关工作

ETA 任务方法
- segment-based methods
  - computationally efficient and scalable
    ls-type:: annotation
    hl-page:: 9
    hl-color:: blue
  - do not account for the information of the travel route
    ls-type:: annotation
    hl-page:: 9
    hl-color:: blue
- end-to-end methods
  - 之间方法对拥堵传播建模不够 most existing methods are inefficient for modeling the traffic congestion propagation patterns along the route.
    ls-type:: annotation
    hl-page:: 9
    hl-color:: blue
STGCNN [[Traffic Flow Forecasting]]
- 提升 GNN 层数感受野增加太多 increasing the depth of a GNN often means exponential expansion of the neighbor scope
  ls-type:: annotation
  hl-page:: 9
  hl-color:: blue
- 子图 properly extracted subgraph
  ls-type:: annotation
  hl-page:: 9
  hl-color:: blue
[[@ConSTGAT: Contextual Spatial-Temporal Graph Attention Network for Travel Time Estimation at Baidu Maps]] 建模时空关系
[[@SSML: Self-Supervised Meta-Learner for En Route Travel Time Estimation at Baidu Maps]] 建模驾驶员行为

解决方法

traffic conditions 是动态特征
- 过去 1 小时路况特征，每 5 分钟一个分桶，共 12 个 he traffic conditions of the past one hour are collected as features, which are divided into 12 time slots (5 minutes per time slot)
  ls-type:: annotation
  hl-page:: 3
  hl-color:: yellow
- median speed, max speed, min speed, mean speed, and record counts as features
  ls-type:: annotation
  hl-page:: 3
  hl-color:: yellow
Congestion-sensitive Graph $\mathcal{G}^{C S}=\left(\mathcal{L},\left.\left\{\mathcal{E}_r^f\right\}\right|_{r=1} ^5, \mathcal{E}^h\right)$
- we construct a congestion-sensitive graph based on the correlations of traffic patterns.
  ls-type:: annotation
  hl-page:: 2
  hl-color:: yellow
- 对于某一个 link 找一阶相邻 link 以及高阶相邻link（可能和当前 link 的路况状态有关系）
  - we take advantage of the first-order neighbor links, as well as the high-order neighbor links whose traffic patterns are highly correlated to that of link 𝑙
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow
- First-order Neighbors
  ls-type:: annotation
  hl-page:: 3
  hl-color:: yellow
  - [[ConSTGAT]] 不同相邻 link 对于当前 link 的影响
    - 当前 link 的路况状态可能受下游影响大于上游 the traffic congestion is more likely to propagate from downstream links to upstream links.
      ls-type:: annotation
      hl-page:: 3
      hl-color:: blue
  - 具体过程
    - 定义多种 link 之间关系，并在建图中考虑这些关系 define multiple types of link relations and incorporate these relations into the construction of the congestion-sensitive graph
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow
    - 用 attention 分别处理各种关系捕捉影响 use attention mechanism separately for each relation to capture the impact of neighbor links,
      ls-type:: annotation
      hl-page:: 3
      hl-color:: yellow
    - 用 edge 描述两个 link 之间的关系，一共有 5 种类型
      - [:span]
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow
      - An edge describes the relation between two links
        ls-type:: annotation
        hl-page:: 3
        hl-color:: yellow
      - 2 是上游 link
      - 3 是下游 link
      - 剩余三种 link 不在路线中，但是这些 link 的路况状态可能影响目标 link（车辆阻塞路口）
- High-order Neighbors
  ls-type:: annotation
  hl-page:: 3
  hl-color:: yellow
  - 间接连接 link 也很重要 the long-distance associations between indirectly connected links are also crucial for ETA prediction
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow
  - 如何从高阶邻居采样？
    - link 从 historical route 从取 2-hop 到 5-hop 的邻居 link
    - 计算 link 和邻居 link 的 Pearson correlation
      ls-type:: annotation
      hl-page:: 4
      hl-color:: yellow
      $c^r_{i,j} = \frac{\operatorname{cov}\left(T_1, T_2\right)}{\rho_{T_1} \rho_{T_2}}$
      - 取 link 过去 2 小时，每 5 分钟的平均通过时间序列 $T_1=\left[t_1^0, t_1^1, \cdots, t_1^{23}\right]$ 和 $T_2=\left[t_2^0, t_2^1, \cdots, t_2^{23}\right]$
    - 累加同一个 link pair 在不同 route 上的相关系数得到 $c^{final}_{i,j}$
    - 每个 link 取相关系数 top5 的邻居 link
  - 连接 link 和 high-order neighbor links high-order edge is defined as an edge that connects a link and one of its high-order neighbor links.
    ls-type:: annotation
    hl-page:: 4
    hl-color:: yellow
  - [:span]
    ls-type:: annotation
    hl-page:: 3
    hl-color:: yellow
[[Graph Transformer]] Masked Label Prediction: Unified Message Passing Model for Semi-Supervised Classification
- 多头学习 edge 的权重 t adopts the multi-head attention mechanism [ 23] to learn edge weights.
  ls-type:: annotation
  hl-page:: 4
  hl-color:: yellow
  - 对于每个 edge 计算 attention score
    - $\begin{gathered}\mathbf{q}_{c, i}=\mathbf{W}_c^Q \mathbf{x}_i+\mathbf{b}_c^Q, \\ \mathbf{k}_{c, j}=\mathbf{W}_c^K \mathbf{x}_j+\mathbf{b}_c^K, \\ \mathbf{v}_{c, j}=\mathbf{W}_c^V \mathbf{x}_j+\mathbf{b}_c^V, \\ \alpha_{c, i, j}=\frac{\left\langle\mathbf{q}_{c, i}, \mathbf{k}_{c, j}\right\rangle}{\sum_{k \in \mathcal{N}(i)}\left\langle\mathbf{q}_{c, i}, \mathbf{k}_{c, k}\right\rangle},\end{gathered}$
  - 计算 link i 的表示
    - $\mathbf{h}_i=\mathbf{x}_i+\frac{1}{C} \sum_{c=1}^C \sum_{j \in \mathcal{N}(i)} \alpha_{c, i, j} \mathbf{v}_{c, j}$
- resnet 解决 [[GNN]] 的 oversmoothing 问题 t addresses the oversmoothing problem in vanilla GNNs by residual connections.
  ls-type:: annotation
  hl-page:: 4
  hl-color:: yellow
route-aware graph transformer
- [:span]
  ls-type:: annotation
  hl-page:: 4
  hl-color:: yellow

tags:: #[[Model Architecture]] #[[Graph Transformer]]

+ 重新构建的图$\mathcal{G}^{C S}=\left(\mathcal{L},\left.\left\{\mathcal{E}_r\right\}\right|_{r=1} ^6\right)$有六种类型的边，拆分成六张子图，每一张子图用一个 transformer

  + $\mathbf{h}_i=\mathbf{x}_i+\frac{1}{6 C} \sum_{r=1}^6 \sum_{c=1}^C \sum_{j \in \mathcal{N}_r(i)} \alpha_{c, i, j}^{(r)} \mathbf{v}_{c, j}^{(r)}$

+ 之前的特征 transformer 无法区分一个 link 是否在路线上，无法生成不同的表示

  + the graph transformer is unable to identify whether a link belongs to a given route or not