BERT

[[@BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding]]

代码:google-research/bert: TensorFlow code and pre-trained models for BERT

大模型 + 微调提升小任务的效果

输入层

  • 词嵌入(token embedding)、位置嵌入(position embedding)段嵌入(segment embedding)

    • 预训练任务包含判断 segment A 和 segment B 之间的关系
  • 模型结构 12 层,每层 12 个 multi-head

  • CLS 句子开头,最后的输出 emb 对应整句信息

    • 无语义信息的符号会更公平地融合文本中各个词的语义信息,从而更好的表示整句话的语义
  • SEP 句子之间分割

BERT

  • L=12 H=768 A=12, Total Parameters=110M

  • L=24 H=1024 A=16, Total Parameters=340M

两种 NLP 预训练

    1. 产出产品,例如 word2evc 的 embedding
    1. 做为骨架接新结构

[[ELMo]]

  • 使用 LSTM 预测下一个单词

[[GPT]]

  • Transformer

  • 单向

-w1304

贡献性

  • 双向信息重要性

模型输入:

    1. Token emb
    1. Segment emb(A B) 针对 QA 或者两个句子的任务
    1. Position emb

训练方式

  • [[Masked-Language Modeling]] :->mask 部分单词,80 % mask,10 % 错误单词, 10% 正确单词

    • 目的 :-> 训练模型记忆句子之间的关系。
      • 减轻预训练和 fine-tune 目标不一致给模型带来的影响
  • [[Next Sentence Prediction]] :-> 预测是不是下一个句子

    • 句子 A 和句子 B 有 50% 的概率是上下文

    • 解决后续什么问题 :-> QA 和自然语言推理
      image.png
      occlusion:: eyIuLi9hc3NldHMvaW1hZ2VfMTczNDYxNjMzODQyMV8wLnBuZyI6eyJjb25maWciOnt9LCJlbGVtZW50cyI6W3sibGVmdCI6MzY3LjEzMDExNTk3NDg1MjYsInRvcCI6NTkuNDE3NTUwMDkwNDM3Mzk1LCJ3aWR0aCI6NjIzLjU5MTg3MjM5Mzc2MDksImhlaWdodCI6MTE4LjgzNTEwMDE4MDg3NDc2LCJhbmdsZSI6MCwiY0lkIjoxfSx7ImxlZnQiOjEwODEuOTAzNDAxNTY2MDE5LCJ0b3AiOjY1LjA2OTA2NDM1NDU0MTcsIndpZHRoIjo2NjUuMjAzOTI0MjY0NDI2LCJoZWlnaHQiOjkwLjM5NTU1NzkzMTAxMTA3LCJhbmdsZSI6MCwiY0lkIjoyfV19fQ==
      [[激活函数]] [[GELU]]

  • 和 [[GPT]] 一致,为什么?

优化器

  • 不完整版 adam

  • fine tune 时可能不稳定,需要换成正常版 adam

fine tune

  • 根据任务调整输入和增加预测结构,使用相关数据训练

  • 使用 fine tune 比将bert做为特征放到模型中效果要好

    1. 双句分类
    1. 单句分类
    • CLS 后接 softmax
    1. 预测一个 start 和 end embedding,然后和 T 计算 softmax 取概率最大的做为开始和结束的位置
    1. 实体标注

研究取不同的 embedding 效果

缺陷

  • 不擅长生成类任务(机器翻译、文本摘要)

[[Ref]]


TCN

  • TCN 中输入和输出可能有不同的宽度,c 图表示使用 11 卷积调整输入大小

    • 也可以直接通过 zero padding 来增加 channels

TCN = 1D FCN + causal convolutions

特点

  • 使用因果卷积,不会泄漏未来信息。

    • 论文中强调和 RNN 之类方法进行对比,所以要考虑因果。
  • 可以取任意长度的序列,并将其映射到相同长度的输出序列。

  • 引入 [[ResNet]] 和扩张卷积的组合可以将网络做深以及增加感受野。

细节

  • tcn 中没有 pooling 层

  • normalization 方法是 weight norm,更适合序列问题

增加感受野的方法

  • 更大的 kernel_size (增加参数,卷积核大效果差,卷积核过大会退化成一个全连接层)

  • [[空洞卷积]]

时序问题

    1. 输入和输出矩阵大小相同
    1. 不能使用没有发生时刻的信息,因果卷积

[[ETA 模型]] 实现

  • tf.nn.conv1d(input, filters, stride, padding, data_format='NWC', dilations=None, name=None)

@ETA Prediction with Graph Neural Networks in Google Maps

[[Abstract]]

  • Travel-time prediction constitutes a task of high importance in transportation networks, with web mapping services like Google Maps regularly serving vast quantities of travel time queries from users and enterprises alike. Further, such a task requires accounting for complex spatiotemporal interactions (modelling both the topological properties of the road network and anticipating events—such as rush hours—that may occur in the future). Hence, it is an ideal target for graph representation learning at scale. Here we present a graph neural network estimator for estimated time of arrival (ETA) which we have deployed in production at Google Maps. While our main architecture consists of standard GNN building blocks, we further detail the usage of training schedule methods such as MetaGradients in order to make our model robust and production-ready. We also provide prescriptive studies: ablating on various architectural decisions and training regimes, and qualitative analyses on real-world situations where our model provides a competitive edge. Our GNN proved powerful when deployed, significantly reducing negative ETA outcomes in several regions compared to the previous production baseline (40+% in cities like Sydney).

[[Attachments]]


@TabNet: Attentive Interpretable Tabular Learning

[[Abstract]]

  • We propose a novel high-performance and interpretable canonical deep tabular data learning architecture, TabNet. TabNet uses sequential attention to choose which features to reason from at each decision step, enabling interpretability and more efficient learning as the learning capacity is used for the most salient features. We demonstrate that TabNet outperforms other neural network and decision tree variants on a wide range of non-performance-saturated tabular datasets and yields interpretable feature attributions plus insights into the global model behavior. Finally, for the first time to our knowledge, we demonstrate self-supervised learning for tabular data, significantly improving performance with unsupervised representation learning when unlabeled data is abundant.

[[Attachments]]