标签: Google - 算法花园

2024-10-052024-12-19 随手记 7 分钟读完 (大约1058个字)

BERT

[[@BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]]

代码：google-research/bert: TensorFlow code and pre-trained models for BERT

大模型 + 微调提升小任务的效果

输入层

词嵌入（token embedding）、位置嵌入（position embedding）段嵌入（segment embedding）
- 预训练任务包含判断 segment A 和 segment B 之间的关系
模型结构 12 层，每层 12 个 multi-head
CLS 句子开头，最后的输出 emb 对应整句信息
- 无语义信息的符号会更公平地融合文本中各个词的语义信息，从而更好的表示整句话的语义
SEP 句子之间分割

BERT

L=12 H=768 A=12, Total Parameters=110M
L=24 H=1024 A=16, Total Parameters=340M

两种 NLP 预训练

1. 产出产品，例如 word2evc 的 embedding
1. 做为骨架接新结构

[[ELMo]]

使用 LSTM 预测下一个单词

[[GPT]]

Transformer
单向

-w1304

贡献性

双向信息重要性

模型输入：

1. Token emb
1. Segment emb(A B) 针对 QA 或者两个句子的任务
1. Position emb

训练方式

[[Masked-Language Modeling]] :->mask 部分单词，80 % mask，10 % 错误单词， 10% 正确单词
- 目的 :-> 训练模型记忆句子之间的关系。
  - 减轻预训练和 fine-tune 目标不一致给模型带来的影响
[[Next Sentence Prediction]] :-> 预测是不是下一个句子
- 句子 A 和句子 B 有 50% 的概率是上下文
- 解决后续什么问题 :-> QA 和自然语言推理
  
  occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTczNDYxNjMzODQyMV8wLnBuZyI6eyJjb25maWciOnt9LCJlbGVtZW50cyI6W3sibGVmdCI6MzY3LjEzMDExNTk3NDg1MjYsInRvcCI6NTkuNDE3NTUwMDkwNDM3Mzk1LCJ3aWR0aCI6NjIzLjU5MTg3MjM5Mzc2MDksImhlaWdodCI6MTE4LjgzNTEwMDE4MDg3NDc2LCJhbmdsZSI6MCwiY0lkIjoxfSx7ImxlZnQiOjEwODEuOTAzNDAxNTY2MDE5LCJ0b3AiOjY1LjA2OTA2NDM1NDU0MTcsIndpZHRoIjo2NjUuMjAzOTI0MjY0NDI2LCJoZWlnaHQiOjkwLjM5NTU1NzkzMTAxMTA3LCJhbmdsZSI6MCwiY0lkIjoyfV19fQ==
  [[激活函数]] [[GELU]]
和 [[GPT]] 一致，为什么？

优化器

不完整版 adam
fine tune 时可能不稳定，需要换成正常版 adam

fine tune

根据任务调整输入和增加预测结构，使用相关数据训练
使用 fine tune 比将bert做为特征放到模型中效果要好
1. 双句分类
1. 单句分类
- CLS 后接 softmax
1. 预测一个 start 和 end embedding，然后和 T 计算 softmax 取概率最大的做为开始和结束的位置
1. 实体标注

研究取不同的 embedding 效果

缺陷

不擅长生成类任务(机器翻译、文本摘要)

[[Ref]]

[[Multimodal BERT]]
The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning) – Jay Alammar – Visualizing machine learning one concept at a time
如何评价 BERT 模型？ - 知乎
NLP 从语言模型看Bert的善变与GPT的坚守 - 知乎
- 像Bert这样的双向语言模型为何要做 masked LM？[[GPT]] 为何一直坚持单向语言模型？ Elmo 也号称双向，为何不需要 mask？[[Word2Vec]] 的 CBOW 为何也不用 mask？
- indirectly see themselves
- GPT 保留用上文生成下文的能力
为什么 Bert 的三个 Embedding 可以进行相加？ - 知乎
- 三个 embedding 相加和拼接
  - 联系 :-> 三个 embedding 相加相当于三个原始的 one-hot 拼接再经过一个全连接网络。
    - 优点 :-> 和拼接相比，相加可以节约模型参数。
    - 实验显示拼接并没有相加效果好，拼接后维度增加，需要再经过一个线性变换降低维度，增加了更多参数。
- 之前的理解和多个波长不同的波相加，最后还是能分离出来，所以模型也应该能区分。
- 空间维度很高，模型能区分各个组分
  - 参数空间量 30k2512
  - 模型表达能力至少是 2^768
- 梯度角度，(f + g +h)' = f' + g' + h'
BERT—容易被忽视的细节
- 细节三：对于任务一，对于在数据中随机选择 15% 的标记，其中80%被换位[mask]，10%不变、10%随机替换其他单词，原因是什么？#card
  - [mask] 在 fine-tune 任务中不会出现，模型不知道如何处理。
  - 缓解上面的现象
  - 15% 标记被预测，需要更多训练步骤来收敛

2024-10-052024-10-05 随手记 2 分钟读完 (大约273个字)

TCN

TCN 中输入和输出可能有不同的宽度，c 图表示使用 11 卷积调整输入大小
- 也可以直接通过 zero padding 来增加 channels

TCN = 1D FCN + causal convolutions

特点

使用因果卷积，不会泄漏未来信息。
- 论文中强调和 RNN 之类方法进行对比，所以要考虑因果。
可以取任意长度的序列，并将其映射到相同长度的输出序列。
引入 [[ResNet]] 和扩张卷积的组合可以将网络做深以及增加感受野。

细节

tcn 中没有 pooling 层
normalization 方法是 weight norm，更适合序列问题

增加感受野的方法

更大的 kernel_size (增加参数，卷积核大效果差，卷积核过大会退化成一个全连接层)
[[空洞卷积]]

时序问题

1. 输入和输出矩阵大小相同
1. 不能使用没有发生时刻的信息，因果卷积

[[ETA 模型]] 实现

tf.nn.conv1d(input, filters, stride, padding, data_format='NWC', dilations=None, name=None)

Paper, Algorithm, Google, CNN

2021-10-262024-10-05 随手记 1 分钟读完 (大约205个字)

@ETA Prediction with Graph Neural Networks in Google Maps

[[Abstract]]

Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events—such as rush hours—that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney).

[[Attachments]]

ETA Prediction with Graph Neural Networks in Google Maps_2021_Derrow-Pinion.pdf

Paper, Google

2019-08-202024-10-05 随手记 1 分钟读完 (大约121个字)

@TabNet: Attentive Interpretable Tabular Learning

[[Abstract]]

We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant.

[[Attachments]]

TabNet_2019_Arik_Pfister.pdf

Paper, Google, 想读, Tabular Data

分类

链接

最新文章

标签