900字范文,内容丰富有趣,生活中的好帮手!
900字范文 > ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

时间:2021-11-25 16:07:31

相关推荐

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

目录

ViLBERT: Extending BERT to Jointly Represent Images and TextExperimental SettingsReferences

ViLBERT:Vision-and-LanguageBERT

ViLBERT: Extending BERT to Jointly Represent Images and Text

Two-stream Architecture:ViLBERT采用two-stream架构,由两个并行的 BERT-style 模型分别对 image region features v1,...,vTv_1,...,v_{\mathcal T}v1​,...,vT​ 和 text input w0,...,wTw_0,...,w_Tw0​,...,wT​ 进行信息建模 (文本部分的 BERT 参数可由 BERT 进行初始化)。每个 stream 都由一系列的transformer blocks (TRM)co-attentional transformer layers (Co-TRM)组成,其中 Co-TRM 被用来促进模态间的信息交换。最终模型输出 (hv0,...hvT)(h_{v_0},...h_{v_{\mathcal T}})(hv0​​,...hvT​​) 和 (hw0,...,hwT)(h_{w_0},...,h_{w_T})(hw0​​,...,hwT​​)

注意到,两个 streams 之间的信息交换被限制在了特定的层上,并且由于输入的 image region features 本身就是经过 CNN 处理过的 high-level 特征,因此 text stream 在和 visual features 交互之前还做了更多的处理 (This structure allows forvariable depths for each modalityand enablessparse interactionthrough co-attention.)Co-Attentional Transformer Layers (Co-TRM).

Image Representations.image region features即为一个预训练好的 Faster R-CNN 抽取出的 bounding boxes 对应的 visual features,选出的 bounding boxes 均需超过 confidence threshold 并且每张图片只保留 10 到 36 个 high-scoring boxes。同时由于 image regions 缺少一个自然的排序顺序,我们转而用一个 5-ddd 向量对 image regions 的空间位置进行了编码,包括 region position (normalized top-left and bottom-right coordinates) 和 the fraction of image area covered。接着,该向量被投影到与 visual features 相同的维度进行相加,得到最终的 Image Representations。最后,我们还在图像特征输入的开头添加了特殊 token[IMG]用于代表整张图片的信息 (i.e. mean-pooled visual features with a spatial encoding corresponding to the entire image)Training Tasks and Objectives.(使用的数据集为 Conceptual Captions) (1)masked multi-modal modelling: 类似于 BERT 的 MLM,随机遮盖 15% 的 words 和 image regions (被选中遮掩的 image regions 有 90% 的几率被置零,words 的处理与 BERT 一致),然后让模型重建被遮盖的 words 或预测出被遮盖的 image regions 对应的语义类别 (minimize KL divergence)(2)multi-modal alignment prediction: 模型需要预测 image 和 text 是否匹配。我们将 hIMGh_{\text{IMG}}hIMG​ 和 hCLSh_{\text{CLS}}hCLS​ 作为视觉和语言输入的整体特征表示,将它们进行 element-wise product 后送入线性层得到最终的预测结果 (负例样本通过随机替换配对的图像或文字得到)

Experimental Settings

We apply our pretrained model as a base for four established vision-and-language tasks –Visual Question Answering (VQA),Visual Commonsense Reasoning (VCR)(Q →\rightarrow→ A, QA →\rightarrow→ R),Grounding Referring Expressions(localize an image region given a natural language reference), andCaption-Based Image Retrieval–settingstate-of-the-arton all four tasks.

References

ViLBERT: Pretraining Task-Agnostic Visiolinguistic Representations for Vision-and-Language Tasks

本内容不代表本网观点和政治立场,如有侵犯你的权益请联系我们处理。
网友评论
网友评论仅供其表达个人看法,并不表明网站立场。