900字范文 > 论文-《Visual Question Answering A tutorial》重点翻译+扩展

论文-《Visual Question Answering A tutorial》重点翻译+扩展

时间：2021-10-21 11:00:25

论文笔记

论文下载

摘要Abstract：

Tremendous advances have been seen in the field of computer vision due to the success of deep learning, in particular on low- and midlevel tasks, such as image segmentation or object recogni- tion. These advances have fueled researchers’confidence for tackling more complex tasks that combine vision with lan- guage and high-level reasoning.

计算机视觉领域有了巨大的进展，尤其是在中小型任务中，例如图像分割，或者物体识别。

This article presents the ongoing work in the field and the current approaches to VQA based on deep learning.

本文主要介绍了该领域正在进行的工作以及基于深度学习的VQA如今的方法。

While the field of VQA has seen recent successes, it remains a largely unsolved task.

尽管VQA领域最近取得了成功，但还是有很多未被解决的难题。

介绍Introduction：

Deep visual understanding can be defined as the abil- ity of algorithm to extract high-level information from imag- es and to perform reasoning based on that information. In this regard, VQA is an alternative to other tasks proposed to evaluate this capability. Examples include the visual Turing test [23], the task of image captioning [20], [73], and recent works on visual dialogs [18].

深度视觉理解可以定义为根据算法从图像中提取高水平的数据并进行推理分析，在这方面，VQA是其他任务的替代品，例如图灵测试，图像字幕，和最近的视觉对话。

A second parallel motivation for the study of VQA is itsutility in its own right.

研究VQA第二个原因是它能够独立的应用。

Note, however, that current VQA data sets do not directly address this setting, because questions are typically collected in a nongoal-oriented setting.

但是当前的VQA并不能直接地做到这一步，这是因为VQA数据集通常都从非目标源中收集（个人理解是没有针对性，数据集杂乱）。

Realistic, motivatedquestions would likely require information not present in the image and involve rare words and concepts.

实际上，最主要的原因是问题可能需要知道图像中不存在的信息和涉及罕见词汇和概念。

Historically, one of the earliest inte- grations of computer vision with language was the SHRDLU system dating back to 1972 [78]。

在过去，计算机视觉和语言最早集成起来可以追溯到1972年的SHRDLU系统。

注释：
首先将图像和NLP 结合的模型是“SHRDLU”系统，是维诺格拉德（T. Winograd）于1972年在美国麻省理工学院建立的一
个用自然语言指挥机器人动作的系统。该系统把句法分析、语义分析、逻辑推理结合起来，大大地增强了系统在语言分析
方面的功能，使用者可以用自然语言只会机器人完成一系列动作。比如：
人问: PICK UP A BIG RED BLOCK.
机答: OK. (抓起大的红色积木块)

However, these early works were often limited to specific domains and/or simple language.。

然而，早期的任务经常受限于专业领域或者比较简单的语言。

Deep learning has now been applied to virtually every problem imaginable in computer vision, and convolutional neural networks (CNNs) are approaching human performance in tasks such as image segmentation [39] or object recognition [19], [24].

深度学习如今广泛的应用于计算机视觉能想到的每个问题，卷积神经网络在图像分割和物体识别等方面的性能页逐渐逼近人类。

任务定义和数据集Task definition and data sets：

The task for the machine is to determine the correct answer, which is, in current data sets, typically a few words or a short phrase.Two practical variants are usually considered, an open-ended and a multiple-choice setting.

机器学习的任务是为了确定一个正确的答案，在当前的数据集中，答案通常是由几个单词或者一个简短的短语组成。

Two practical variants are usually considered, an open-ended and a multiple-choice setting [5], [92]. In the latter, a set of candidate answers are proposed. This makes the evaluation of a generated answer easier than in the open-ended setting, where the comparison between the machine’s output and a ground truth (i.e., human provided) answer faces issues with synonyms.

答案通常被考虑为两种实用的方式，即开放式回答和多分类式回答。多分类式回答提供了一组候选答案，相比开放式回答要更容易。而在开放式回答中，机器生成的答案和人类提供的答案存在着同义词和释义等问题。

VQA is also related to the task of textual question answering [10], [14], [88], in which the answer is to be found in a textual nar- rative (i.e., reading comprehension) or in large knowledge bases (KBs) (i.e., information retrieval).

VQA和文本问答任务相关，回答可以从文本叙述（例如阅读理解）或者大型知识库KBs（例如信息检索）中发现。

The additional challenge of a visual input is significant because images are simply much higher dimensional than text. Images capture the richness of the real world in a noisy manner, whereas natural language already represents a certain level of abstraction.

视觉输入的增加的挑战是巨大的，因为图像的尺寸比文本的尺寸高很多。图像以嘈杂的方式捕获了现实世界的丰富性，然而语言仅仅是代表了一定程度的抽象。

While, to some extent, the processing of language ispossible with discrete and rule-based approaches, such as syntactic parsers and regular expression matching, the complexity of images renders such engineered methodsintractable

虽然，在某种程度上，可以使用基于离散和基于规则的方法来处理语言，例如语义解析器和正则表达式匹配，但是图像的复杂性使得这种工程化方法难以处理。

Modern computer vision is based on statistical learning, and recent works combining vision and language (including image captioningand VQA) similarly evolved from machine-learning techniques.

现在的计算机视觉是基于统计学习的，结合了计算机视觉和语言（包括VQA和image captioning）的最近工作也类似的由机器学习发展而来。

The two tasks are complementary as they evaluate different capabilities. Captioning requires mostly descriptive capabilities that involve almost purely visual information. VQA, incomparison, often requires reasoning with common sense and with other information not present in the given image.

两个任务是互补的，因为它们评估了不同的能力。image captioning更多的要求描述性能，这几乎涉及纯粹的视觉信息，相比之下，VQA通常要求通过常识和给定图片中不存在的其它信息进行推理。

训练和评估VQA数据集Data sets for training and evaluating VQA：

We now examine data sets that have been specifically compiledfor research on VQA. These data sets contain, at a minimum,triples made each of an image, a question, and its correct answer.

我们现在检查专门为研究VQA而编译的数据集，这些数据集的每个三元组至少包含一个图像，一个问题和一个正确的答案。

Those data sets are designed for bothevaluating and training VQA systems in asupervised setting, and the latter demands such large amountsof data. As will be discussed in the section “Directions of Currentand Future Research,” this very need for large amounts of data isa significant limit of current approaches.

这些数据集被设计出来是用于在监督机制下评估和训练VQA系统，而后者（现代数据集）需要大量的数据，正如将在“当前的方向和未来的方法”这一节讨论的那样，需要大量数据的数据集是当前方法的重要限制。

For the purpose of standardized comparisons and benchmarking of different algorithms, data sets are split into predetermined sets of instances for training, validation, and testing.

为了对不同的算法进行标准化比较和基准测试，数据集通常分成预训练数据集实例，以进行训练，验证和测试。

Existing data sets vary mainly along three dimensions 1)their size, i.e., the number and variety concepts representedin the images and questions; 2) the amount of required reasoning, e.g., whether the detection of a single object is sufficient or whether inference is required over multiple factsor concepts; and 3) how much information beyond what ispresent in the input image is necessary to infer an answer,e.g., common sense or subject-specific information.

现有数据集主要包含3个维度：第一是它们的大小，也就是图像和问题中表示的数字和种类的概念；第二是想要推理的信息数量，例如检测单个物体是否能够或者是否需要对多个事实或者概念进行推理；第三是为了推断答案而超出输入图像表示的必需信息，例如常识或者特定主题信息。

Most data sets lean toward visual-level questions and require littleexternal knowledge beyond common sense. These characteristics reflect the fact that current state-of-the-art methods stillstruggle with simple visual questions.

大多数数据集倾向于视觉水平的问题并且很少需要超过常识以外的知识，这些特征反映了一个问题，即当前主流方法仍然挣扎在解决简单的视觉问题当中。

The first VQA data set designed as a benchmark was DataSet for Question Answering on Real World (DAQUAR) forimages [45].

第一个被设计为基准的VQA数据集是DAQUAR。

注释：

VQA第一个重要的数据集是DAtaset for QUestion Answering on Real-world images（DAQUAR）。它包括6794对训练问答
题，以及5674对测试问答题，其中的图片都来自NYU-Depth V2数据集。这意味着平均每张图片有9对问答题。该数据集质
量较差，一些图像杂乱无章，分辨率低，并且问题和回答有明显的语法错误。

The most popular modern data sets [5], [35], [92]use images sourced from Microsoft Common Objects in Context (COCO), [40] a data set initially devised for image recognition, which is itself composed of images from Flickr.

现在的主流数据集通常源于下文中提到的COCO图像，该数据集最初是为图像识别而设计的，其本身由Flickr的图像组成。

注释：
1.COCO数据集是一个大型的、丰富的物体检测，分割和字幕数据集。这个数据集以sceneunderstanding为目标，主要从
复杂的日常场景中截取，图像中的目标通过精确的segmentation进行位置的标定。图像包括91类目标，328,000影像和
2,500,000个label。目前为止有语义分割的最大数据集，提供的类别有80类，有超过33万张图片，其中20万张有标注，整
个数据集中个体的数目超过150万个。
2.Flickr是一个图片存储和视频托管网站。在，Ludicorp公司建立了一个Web服务套件和在线社区（Flickr前身），雅
虎于收购Flickr。Flickr除了供用户储存个人照片，还可以把图片分享到博客和社交媒体。

VQA-real：

The most widely used data set is currently the one proposed by a team of researchers from Virginia Techand is commonly referred to as VQA [5]. It comprises two parts, one using natural images named VQA-real, and a second one with clipart images named VQA-abstract (discussed at the end of this section).

目前应用最广泛的数据集是由弗吉尼亚理工大学的一个研究团队提出的，通常被称为VQA。它包括两部分，第一部分使用自然真实图片，被称为VQA-real，第二部分使用人工拼接图片，被称为VQA-abstract（在本节的最后讨论）。

Visual genome and visual7W

The Visual Genome QA data set [35] is currently the largest onedesigned for VQA, with 1.7 million question/answer pairs.

Visual Genome数据集是当前最大的VQA数据集，由170万个问题/答案对。

注释：
Visual Genome：该数据的图像来自于COCO和YFCC100M，共108249张图，包括170万个QA pairs，至目前位置（这篇文章的发表年份10月），该数据集是最大的VQA数据集。数据集的提问为6W：What, Where, How, When, Who,
and Why，该数据集答案的多样性要明显好于其他数据集，且答案的词数要多于其他数据集。另外提问没有“是否”的问题。

The Visual7w [92]data set is a subset of the Visual Genome that allows evaluation ina multiple-choice setting, as each question is provided with fourplausible but incorrect candidate answers.

Visual7w数据集是Visual Genome数据集的子集，它允许在多选项中进行评估，每一个问题提供了四个可能但是不正确的候选回答。

注释：

Visual7W：该数据集是上一个数据集的扩充，7W则指What, Where, How, When, Who, Why, and Which。该数据集包含了47300张图。为了准确回答问题，这里用到了bounding box来圈出可能的4个答案。

Zero-shot VQA：

A special version of the Visual7W data set was proposed in[70]. The authors redefined thetraining and test splits suchthat every test instance includes one or several words thatwere not present in any training example.

在[70]中提出了Visual7W 数据集的特殊版本，作者重新定义了训练和测试数据集的划分，测试实例包含一个或者几个单词并且不会出现在任何训练实例中。

Clipart images：

Data sets for VQA have also been proposed with synthetic clipart images (referred to as abstract scenes in [5]). These images were created manually with cartoon representations of characters and objects from a predefined set.

VQA的数据集提出了合成剪贴图像（[5]中称为抽象场景）。这些图像是从预定义数据集中角色和对象的卡通表示手动创建的。

That data set contains only binary (yes/no) questions and each question appears twice in the data set, with two different images that give rise to opposite answers.

数据集仅仅包含二进制（是/否）问题，并且每个问题出现两次，对应两个不同的图像给出相反的答案。

Despite undeniable advantages, VQA data sets of clipart images have seen little use [5], [69], [90] compared to their counterparts of real images.

尽管又不可否认的优势，但是与真实图像相比，合成剪切图像很少被使用。

Video-based QA：

Zhu et al. [91] assembled a data set of over 100,000 videos and 400,000 questions, using existing collections of videos from different domains, from cooking scenarios to movies and web videos.

zhu等人从烹饪场景到电影到网络视频中收集了各种领域的超过100000个视频和400000个问题。

评估Evaluation：

VQA systems are evaluated by inferring the answers on the test split of a given data set. Recent data sets [92] recommend the multiple-choice setting, since there is only one correct answer among the multiple choices. The evaluation is thus straightforward, as one can simply measure the mean accuracy over test questions. In an open-ended setting, several answers could be equally valid, because of synonyms and paraphrasing.

通过推断测试集中给出的答案来评估VQA系统，最近的数据集一般推荐多选设置，因为在多个选项中只有一个正确答案，评估更加直接，一个人就可以测量出测试集的准确性。在开放式设置中，由于同义词和释义，可能会使几个答案同样有效。

The usual workaround is to restrict answers, at the time of the creation of the data sets, to short phrases, typically one to three words.

通常的解决办法是在创建数据集的同时将答案限制在一到三个单词。

VQA深度神经网络Deep neural networks for VQA：

The common approach to VQA is to train a deep neural network with supervision which maps the given image and question to a relative scoring of candidate answers. The main idea is to learn a joint embedding of the visual and textual inputs.

VQA通常的方法是在监督机制下训练一个深度神经网络，并且将给定的图像和问题与相关的候选答案进行评分。主要思想是学习视觉和文本输入的联合嵌入。

图像嵌入Image encoding：

On the computer vision side, the input image xI is processed with a deep convolutional neural network (CNN) to extract image features described as a vector yI.

在计算机视觉方面，输入图像xl经过深度卷积神经网络提取特征描述为矢量yl。

In comparison to classical handcrafted image features such as scale-invariant feature transform (commonly known as SIFT) [41] or histogram of oriented gradients (commonly known as HOG) [16], CNN features provide higher-level representations of the contents of the image, and are naturally produced as a fixed-size vector. The size of this vector is typically in the order of 1,024 or 2,048.

与经典的手工图像特征相比，例如SIFT或者HOG，CNN特征可以提供更高层次的图像内容表示，并自然生成一个固定矢量，这个矢量的大小通常约为1024或者2048.

注释：
尺度不变特征转换(Scale-invariantfeaturetransform或SIFT)是一种电脑视觉的算法用来侦测与描述影像中的局部性特征，它在空间尺度中寻找极值点，并提取出其位置、尺度、旋转不变量，此算法由DavidLowe在1999年所发表，完善总结。在计算机视觉以及数字图像处理中梯度方向直方图（Histogram of Oriented Gradient, HOG）是一种能对物体进行检测的基于形状边缘特征的描述算子，它的基本思想是利用梯度信息能很好的反映图像目标的边缘信息并通过局部梯度的大小将图像局部的外观和形状特征化。

问题嵌入Question encoding：

Initially, the ith word of the question is represented by an index xQi inthe input vocabulary. Each word is then turned into a vector.

最初，输入词汇表中下标xQi表示问题的第i个单词，每个单词都转换成一个矢量。

This uses a mapping implemented as a lookup table [·]W that associates the index of any word of the input vocabulary to a learned vector.

使用一个映射实现作为一个查找表，输入词汇表中的每个单词都和一个学习矢量相关联。

An alternative implementation initially represents each word with a one-hot vector (a vector of all zeros, except for a one at the location of the word index in the vocabulary), which is then multiplied with a dense weight matrix that contains the embeddings of all words.

另一种实现方式是，一开始使用独热码（一个全部为零的矢量，除了词汇表中单词位置所处的那一个除外）表示每个单词，然后乘以一个密集权重矩阵，该矩阵包含所有单词嵌入。

注释：
One-Hot编码，又称为一位有效编码，主要是采用N位状态寄存器来对N个状态进行编码，每个状态都由他独立的寄存器位，并且在任意时候只有一位有效。

A simple option for this purpose is tomake a bag-of-words (BoW), which corresponds to simply averaging the word vectors.i.e.,

一个简单的选择就是制作一个bag-of-words (BoW)模型，相当于一个简单的平均单词矢量。

注释：
Bag of words模型最初被用在文本分类中，将文档表示成特征矢量。它的基本思想是假定对于一个文本，忽略其词序和语
法、句法，仅仅将其看做是一些词汇的集合，而文本中的每个词汇都是独立的。简单说就是讲每篇文档都看成一个袋子
（因为里面装的都是词汇，所以称为词袋，Bag of words即因此而来），然后看这个袋子里装的都是些什么词汇，将其分
类。如果文档中猪、马、牛、羊、山谷、土地、拖拉机这样的词汇多些，而银行、大厦、汽车、公园这样的词汇少些，我
们就倾向于判断它是一篇描绘乡村的文档，而不是描述城镇的。

Another popular option is to feed the word vectors into a recurrent neural network (RNN) such as a long short-term memory (LSTM). An RNN processes words sequentially and can capture the sequential relationships between them. In comparison, a BoW does not account for word order.

另一个流行的选择是将单词矢量送入RNN中，例如LIST，RNN按顺序处理单词并且能够捕获但单词间连续的信息。相比之下，BoW不能很好的处理单词之间的顺序关系。

图像问题特征混合Combination of image and question features：

They are each passed through a learned function before being combined. The intuition here is to map the features to a joint space, in which distances between both modalities become comparable.

它们在组合之前通常都要经过学习功能，是为了将特征映射到联合空间，使两种模式之间的距离变得可比。

输出Output：

The output stage of a VQA system can be seen either as a generation or as a classification task.

VQA系统的输出通常可以看作是一种生成或者分类任务。

The generation of a free-form answer has the advantage of being able to compose complex sentences. In practice however, such a model is difficult to learn [22], [46], [80]. Current data sets are limited to short answers, and a practical alternative is to rather learn a classifier over candidate answers [22], [44], [46], [57].

自由回答的生成任务的优点是包含复杂的句子，然而，在实际生活中，这种模型很难被学习。当前数据集仅限于简短的答案，一种可行的替代方案根据候选答案学习分类器。

For training the model, the classifier is followed by a cross-entropy loss, and the whole network is trained end-to-end by backpropagation to minimize this loss over the set of training examples.

为了训练模型，分类器后面是交叉熵损失，并且通过反向传播对整个网络进行端到端的训练，以在训练集上最小化这种损失。

变体Variations：

Encoding the question and the image with a single recurrent neural network (an LSTM) by passing the image features together with each word embedding [22] or only once prior to the question words [46], [57].

通过图像特征和每一个单词的嵌入或者仅仅需要一次问题中的单词，将问题和图像嵌入到单个RNN中。

Encoding the question with a bidirectional RNN, i.e.,

嵌入问题使用的双向RNN。

Adding additional multiplicative interactions within the network and between the features of the image and of the question. For

在网络以及图像和问题之间添加乘法交互。

Alternative schemes for combining image and question representations, such as element-wise sums and products [33], bilinear operations [30] such as multimodal compact bilinear pooling (MCB) [21], etc.

用于组合图像和问题的另一种方案，例如逐元素求和，双线性运算，例如MCB等。

注释：
Fukui 等人提出了一个池化的方法来嵌入两个特征，称之为“Multimodal Compact Bilinear pooling（MCB）”。它随机投影图像特征和文本特征到高维空间，然后两个向量的卷积可以在傅里叶空间中相乘处理。

高级技术Advanced techniques：

注意力机制Attention mechanisms：

One of the most effective improvements to the joint embedding model is to use visual attention. Humans have the ability to quickly understand visual representations by attending to regions of the image instead of processing the entire scene at once [58]

联合嵌入模型最有效的改进之一是使用视觉注意力。人类拥有通过注意图像的某个领域而不是一次性处理整个场景来快速理解视觉信息的能力。

The main idea behind attention mechanisms is to allow the model to focus on certain regions of the image. The technique involves 1) using region-specific image features and 2) including multiplicative interactions within the neural net- work.

注意力机制的主要思想是允许模型专注于图像的某些区域，该技术涉及（1）使用区域特定的图像特征（2）包括神经网络在内的乘法交互作用。

The attention weights computed for a given question/image can be visualized in the form of “attention maps” for purposes of introspection into the VQA model.

对给定的问题/图像进行注意力权重计算可以以“注意力图”直观地显示，以便更好地对VQA模型进行自省。

预训练语言表示Pretraining language representations：

Each word of the input vocabulary (i.e., any word appearing in the training set) is associated with its own embedding, and those embeddings are normally learned alongside the other parameters of the network via backpropagation.

输入词汇的每一个单词（即出现在训练中的任何单词）都与自己的嵌入相关联，并且通常通过反向传播将这些嵌入与神经网络的其他参数一起学习。

A solution to these issues is to pretrain word embeddings on a larger auxiliary data set. This practice is known in the field of natural language processing and has shown benefit in many tasks besides VQA.

这个问题的一个解决办法就是在一个较大的辅助训练集上进行预训练单词嵌入，这个方法在自然语言处理中是众所周知的，并且在VQA以外的任务中也展现了优势。

Popular methodsfor pretraining word embeddings include Global Vectorsfor Word Representation [53] (GloVe) and word2vec [48],which we outline next.

预训练单词嵌入组常用的方法包括GloVe和word2vec ，我们将在下面叙述。

注释：
1.Glove的主要思想是，相比单词同时出现的概率，单词同时出现的概率的比率能够更好地区分单词。比如，假设我们要表示“冰”和“蒸汽”这两个单词。对于和“冰”相关，和“蒸汽”无关的单词，比如“固体”，我们可以期望P冰-固体/P蒸汽-固体较大。类似地，对于和“冰”无关，和“蒸汽”相关的单词，比如“气体”，我们可以期望P冰-气体/P蒸汽-气体较小。相反，对于像“水”之类同时和“冰”、“蒸汽”相关的单词，以及“时尚”之类同时和“冰”、“蒸汽”无关的单词，我们可以期望P冰-水/P蒸汽-水、P冰-时尚/P蒸汽-时尚应当接近于1。
2.Word2Vec的网络结构很简单，包括一个输入层、一个隐藏层、一个输出层。其中，输入层对应某个（上下文）单词的独热编码向量（共有V个词汇），输出层为与输入单词同时出现的单词的概率分布，换句话说，词汇表中的每个单词，出现在这一上下文中的概率分别是多少。隐藏层由N个神经元组成。其中主要有Skip-Gram和CBOW两种模型，从直观上理解，Skip-Gram是给定input word来预测上下文。而CBOW是给定上下文，来预测input word。

记忆增广的神经网络Memory-augmented neural networks：

The variant proposed in [37] and [83], named dynamic memory networks (DMNs), was successfully applied to VQA. It is built around four modules (see Figure 5).

在[37]和[38]提出的DMNS的变体已经成功地应用于VQA，它围绕四个模块构建。

The input module transforms the input data into a set of discrete vectors called facts. A question module computes a vector representation of the question, using a gated recurrent unit [(GRU), a variant of LSTM]. An episodic memory module retrieves the facts required to answer the ques- tion. Finally, the answer module uses the final state of the memory and the question to predict the final output, using a classic classifier over candidate answers.

输入模块将输入数据转变为一个名为事实的离散向量。问题模块使用门控单元计算表示问题的矢量。情节记忆模块检索回答问题所需要的事实。最后，答案模块通过最终问题和内存的状态，使用分类器对候选答案进行排序来预测最终输出。

运行时外部信息检索Run time retrieval of additional information：

One limitation of the basic joint embedding approach is to attempt to capture all of the information of training examples within the parameters of a neural network. This cannot scale arbitrarily, however. On one hand, any network has a finite capacity and, on the other hand, training examples also provide finite information.

基于联合嵌入方法的一个限制是模型尝试在训练实例的整个神经网络参数中捕获信息，然后，规模不能够任意调整。一方面，任何网络的容量都有限，另一方面，训练实例也只能提供有限的信息。

Several works explored the idea of connecting a VQA system with external sources of information that can be virtually infinite (e.g., web searches) or extensible without needing to retrain the VQA model (e.g., structured KBs).

一些工作探索了将VQA系统与外部信息源连接的方式，这些信息可以是无限的（比如web搜索），或者在不重复训练VQA模型的基础上进行扩展（比如结构化知识库）

In [75] and [82], the authors train a model to interface with aKB. Such KBs, like DBpedia [7] and Freebase [12], are databases compiled with facts ranging from common sense to encyclopedic knowledge.

在[75]和[32]中，作者训练了一个与KB接口的模型，像DBpedia和Freebase，是根据从常识到百科全书知识的事实编译而成的数据库。

目前和将来的研究方向Directions of current and future research：

State-of-the-art methods have consistently improved performance on this data set over the past few years, from anaccuracy of about 58% to over 70% today。

近几年最先进的方法一直在改进此数据集的性能，准确率从58%提升到了70%

数据集偏差问题Issues of data set biases：

The text questions alone often provide strong cues that can be sufficient to answer them correctly, with no regards to the contents of the input image.

测试问题经常提供强有力的线索使之能够不关注输入图像内容而推断出正确答案。

Zhang et al. [90]first proposed a data set of clipart images where each binary question is accompanied by two different images that elicit “yes” and “no” answers, respectively.

Zhang等人手下提出了一个剪切数据集，这个数据集的每个二元问题都对应两个不同的图片，图片各自对应一个“是”和“否”的答案。

生词问题Issues with unknown and novel words：

The current paradigm of training VQA systems with supervision, i.e., with data sets of questions and their ground-truth answers, can only cover a limited set of objects and concepts. Although VQA data sets have grown in size, no finite set of exemplars will ever cover the diversity of objects, actions, relations, etc.

在监督下训练VQA系统的当前范例，即具有问题和对应答案的数据集，仅仅覆盖了有限的对象和概念。虽然VQA数据集的规模在不断扩大，但是没有任何有限的数据集能够覆盖多种多样的对象，行为和关系。

These benchmarks do not encourage addressing rare words and concepts, but rather focus on the concepts most frequent in the data set.

这些基准不鼓励处理那些稀有的单词和概念，而是聚焦于数据集中出现频繁的概念。

We expect that VQA will ultimately require similar principled approaches, such as differentiable computing [26], [50], rather than brute-force learning from limited sets of examples.

我们期待VQA最终将需要类似的原则方法，例如可微计算[26][50]，而不是靠蛮力从有限的数据集学习。

外部知识External knowledge：

This requires thesystem not only to capture actual information from trainingexamples, but to learn to retrieve and use novel information,i.e., learn to learn.

这就要求系统不仅仅需要捕获训练实例中存在的信息，也需要学会检索和使用新颖信息，即学习学习。

模块化方法Modular approaches：

The text questions alone often provide strong cues that can be sufficient to answer them correctly, with no regards to the contents of the input image.

因此，VQA模型仅需要推理图像内容这种显示高级表示。

组合模型Compositional models：

Compositional models wereproposed by Hendricks et al. on the task of image captioning[27]. Andreas et al. [4], [3], [29] were the first to propose acompositional architecture for VQA, named neural modulenetworks.

组成模型由Hendricks 等人提出应用于image captioning，Andreas 等人第一次提出VQA的组成结构，命名为神经模块网络。

An alternative approach that addressescompositionality is the relational networks.

解决组合型的另一个方法是关系网络。

总结Conclusions：

We reviewed popular approaches basedon deep learning, which treat the task as a classification problem over a set of candidate answers. We described the common joint embedding model, and additional improvementsthat build up on this concept, such as attention mechanisms.

我们评估了基于深度学习的几个流行方法，探讨了通过候选回答进行分类任务的问题，我们描述了通用的联合嵌入模型，以及基于此概念的其它改进，例如注意力机制。

本内容不代表本网观点和政治立场，如有侵犯你的权益请联系我们处理。

网友评论

网友评论仅供其表达个人看法，并不表明网站立场。