Visual Question Answering with Memory-Augmented Networks

時間 2019-11-29

標籤 visual question answering memory augmented networks 简体版

原文原文鏈接

Visual Question Answering with Memory-Augmented Networks
2018-05-15 20:15:03
算法

Motivation：網絡

雖然 VQA 已經取得了很大的進步，可是這種方法依然對徹底 general，freeform VQA 表現不好，做者認爲是由於以下兩點：學習

　　1. deep models trained with gradient based methods learn to respond to the majority of training data rather than specific scarce exemplars ; 編碼

　　用梯度降低的方法訓練獲得的深度模型，對主要的訓練數據有較好的相應，可是對特定的稀疏樣本卻不是；spa

　　2. existing VQA systems learn about the properties of objects from question-answer pairs, sometimes indepently of the image. code

　　選擇性的關注圖像中的某些區域是很重要的策略。orm

咱們從最近的 memory-augmented neural networks 以及 co-attention mechanism 獲得啓發，本文中，咱們利用 memory-networks 來記憶 rare events，而後用 memory-augmented networks with attention to rare answers for VQA. blog

The Proposed Algorithm : 圖片

本文的算法流程如上圖所示，首先利用 embedding 的方法，提取問題和圖像的 feature，而後進行 co-attention 的學習，而後將兩個加權後的feature進行組合，而後輸入到 memory network 中，最終進行答案的選擇。ci

Image Embedding：用 pre-trained model 進行特徵的提取；

Question Embedding：用雙向 LSTM 網絡進行語言特徵的學習；

Sequential Co-attention：

這裏的協同 attention 機制，考慮到圖像和文本共同的特徵，相互影響，獲得共同的注意力機制。咱們根據視覺特徵和語言特徵的平均值，進行點乘，獲得一個 base vector m0 ：

咱們用一個兩層的神經網絡進行 soft attention 的計算。對於 visual attention，the soft attention 以及加權後的視覺特徵向量分別爲：

其中 Wv， Wm，Wh 都表示 hidden states。相似的，咱們計算加權後的問題特徵向量，以下：

咱們將加權後的 v 和 q 組合，用來表示輸入圖像和問題對，圖4，展現了 co-attention 機制的整個過程。

Memory Augmented Network：

The RNNs lack external memory to maintain a long-term memory for scarce training data. This paper use a memory-augmented NN for VQA.

特別的，咱們採用了標準的 LSTM 模型做爲 controller，起做用是 receives input data，而後跟外部記憶模塊進行交互。外部記憶，Mt，是有一系列的 row vectors 做爲 memory slots。

xt 表明的是視覺特徵和文本特徵的組合；yt 是對應的編碼的問題答案（one-hot encoded answer vector）。而後將該 xt 輸入到 LSTM controller，如：

對於從外部記憶單元中讀取，咱們將 the hidden state ht 做爲 Mt 的 query。首先，咱們計算搜索向量 ht 和記憶中每一行的餘弦距離：

而後，咱們經過 the cosine distance 用 softmax 計算一個 read weight vector wr：

有這些 read-weights, 一個新的檢索的記憶 rt 能夠經過下面的式子獲得：

最後，咱們將 the new memory vector rt 和 controller hidden state ht 組合，而後產生 the output vector ot for learning classifier.

咱們採用 the usage weights wu 來控制寫入到 memory。咱們經過衰減以前的 state 來更新 the usage weights ：

爲了計算 the write weights，咱們引入一個截斷機制來更新 the least-used positions。此處，咱們採用 m(v, n) 來表示 the n-th smallest element of a vector v. 咱們採用 a learnable sigmoid gate parameter 來計算以前的 read weights 和 usage weights 的 convex combination：

A larger n results in maintaining a longer term of memory of scarce training data. 跟 LSTM 內部的記憶單元相比，這裏的兩個參數均可以用來調整 the rate of writing to exernal memory. 這給咱們更多的自由來調整模型的更新。公式（12）中輸出的隱層狀態 ht 能夠根據 the write weights 寫入到 memory 中：

Answer Reasoning：

有了 the hidden state ht 以及那個外部記憶單元中獲得的 the reading memory rt，咱們將這兩個組合起來，做爲當前問題和圖片的表達，輸入到分類網絡中，而後獲得答案的分佈。

--- Done ！

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。