[Paper Reading] Show, Attend and Tell: Neural Image Caption Generation with Visual Attention

時間 2019-12-09

標籤 paper reading attend tell neural image caption generation visual attention 简体版

原文原文鏈接

論文連接：https://arxiv.org/pdf/1502.03044.pdfhtml

代碼連接：https://github.com/kelvinxu/arctic-captions & https://github.com/yunjey/show-attend-and-tell & https://github.com/jazzsaxmafia/show_attend_and_tell.tensorflowgit

主要貢獻github

在這篇文章中，做者將「注意力機制（Attention Mechanism）」引入了神經機器翻譯（Neural Image Captioning）領域，提出了兩種不一樣的注意力機制：‘Soft’ Deterministic Attention Mechanism & ‘Hard’ Stochastic Attention Mechanism。下圖展現了"Show, Attend and Tell"模型的總體框架。框架

注意力機制的關鍵點在於，如何從圖像的特徵向量a_i中計算獲得上下文向量z_t。對於每個位置i，注意力機制可以產生一個權重e_ti。在Hard Attention機制中，權重α_ti所扮演的角色是圖像區域向量a_i在t時刻被選中做爲解碼器的信息的機率，有且只有一個區域會被選中，爲此，引入變量s_t,i，當區域i被選中時爲1，不然爲0；在Soft Attention機制中，權重α_ti所扮演的角色是圖像區域向量a_i在t時刻輸入解碼器的信息中所佔的比例。（參考Attention機制論文閱讀——Soft和Hard Attention，Multimodal —— 看圖說話（Image Caption）任務的論文筆記（二）引入attention機制）spa

實驗細節.net

在文章中，做者提出使用在ImageNet數據集上預訓練好、不進行微調的VGGNet提取圖像特徵，將block5_conv4（Conv2D）提取到的feature map（14×14×512）reshape爲196×512（L×D，L=196，D=512，即196個圖像區域，每一個區域特徵向量的維度是512）的圖像區域向量a_i。

To create the annotations a_i used by our decoder, we used the Oxford VGGnet pretrained on ImageNet without finetuning.翻譯

In our experiments we use the 14×14×512 feature map of the fourth convolutional layer before max pooling. This means our decoder operates on the flattened 196×512 (i.e L × D) encoding.code

在文章中，做者指出，解碼器LSTM初始的細胞狀態（init_c）與隱層狀態（init_h）由從圖像中提取到的特徵向量及兩個獨立的多層感知機（Multi-Layer Perception, MLP）決定。

The initial memory state and hidden state of the LSTM are predicted by an average of the annotation vectors fed through two separate MLPs(init,c and init,h).htm

相關標籤/搜索