[Paper Reading] Show and Tell: A Neural Image Caption Generator

時間 2019-12-09

標籤 paper reading tell neural image caption generator 简体版

原文原文鏈接

論文連接：https://arxiv.org/pdf/1411.4555.pdfgit

代碼連接：https://github.com/karpathy/neuraltalk & https://github.com/karpathy/neuraltalk2 & https://github.com/zsdonghao/Image-Captioninggithub

主要貢獻網絡

在這篇文章中，做者借鑑了神經機器翻譯（Neural Machine Translation）領域的方法，將「編碼器-解碼器（Encoder-Decoder）」模型引入了神經圖像標註（Neural Image Captioning）領域，提出了一種端到端（end-to-end）的模型解決圖像標註問題。下面展現了從論文中截取的兩幅圖片，第一幅圖片是NIC模型的概述，第二幅圖片描述了網絡的細節。NIC網絡採用卷積神經網絡（CNN）做爲編碼器，長短時間記憶網絡（LSTM）做爲解碼器。學習

實驗細節優化

在文章中，做者提出使用在圖像分類任務（Image Classification Task）中預訓練好的Inception v2做爲編碼器，將其最後一個隱藏層提取到的特徵做爲解碼器隱藏層的初始狀態。可是，在官方給出的源碼neuraltalk中，做者使用了預訓練好的VGG16做爲了編碼器，將Layer FC-4096提取到的特徵做爲了LSTM隱藏層的初始狀態（詳見neuraltalk/py_caffe_feat_extract.py line160）。在官方給出的源碼neuraltalk2中，一樣使用了VGG16做爲編碼器提取圖像特徵（詳見neuraltalk2/train.lua line27）。在zsdonghao對該方法的TensorFlow實現中，使用了Inception v3做爲編碼器（詳見zsdonghao/Image-Captioning/inception_v3(for TF 0.10).py）。

Hence, it is natural to use a CNN as an image 「encoder」, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.ui

An 「encoder」 RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a 「decoder」 RNN that generates the target sentence.編碼

在文章中，做者提出使用隨機梯度降低（Stochastic Gradient Descent）訓練網絡。在官方給出的源碼neuraltalk2中，做者給出了多種訓練網絡的優化器及其參數（rmsprop，adagrad，sgd……詳見neuraltalk2/misc/optim_updates.lua）。zsdonghao/Image-Captioning使用SGD訓練網絡，初始學習率2.0，學習率衰減因子0.5，學習率降低後每一代的數量8.0。

It is a neural net which is fully trainable using stochastic gradient descent.lua

在文章中，做者提出按最大似然訓練模型參數。在zsdonghao/Image-Captioning中，做者使用了tensorlayer.cost.cross_entropy_seq_with_mask()（詳見zsdonghao/Image-Captioning/buildmodel.py line665）。

The model is trained to maximize the likelihood of the target description sentence given the training image.spa

在neuraltalk2中，LSTM層的輸入（Embedding層的輸出）向量維度和LSTM隱藏層的向量維度均設置爲512。zsdonghao/Image-Captioning的設置相同。
在zsdonghao/Image-Captioning中，做者將vocabulary_size設置爲12000。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。