[Paper Reading] Show and Tell: A Neural Image Caption Generator

論文連接:https://arxiv.org/pdf/1411.4555.pdfgit

代碼連接:https://github.com/karpathy/neuraltalkhttps://github.com/karpathy/neuraltalk2 & https://github.com/zsdonghao/Image-Captioninggithub


主要貢獻網絡

在這篇文章中,做者借鑑了神經機器翻譯(Neural Machine Translation)領域的方法,將「編碼器-解碼器(Encoder-Decoder)」模型引入了神經圖像標註(Neural Image Captioning)領域,提出了一種端到端(end-to-end)的模型解決圖像標註問題。下面展現了從論文中截取的兩幅圖片,第一幅圖片是NIC模型的概述,第二幅圖片描述了網絡的細節。NIC網絡採用卷積神經網絡(CNN)做爲編碼器,長短時間記憶網絡(LSTM)做爲解碼器。學習

      圖1 NIC模型的概述      圖2 NIC網絡的細節


 實驗細節優化

Hence, it is natural to use a CNN as an image 「encoder」, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates sentences.ui

An 「encoder」 RNN reads the source sentence and transforms it into a rich fixed-length vector representation, which in turn in used as the initial hidden state of a 「decoder」 RNN that generates the target sentence.編碼

  •  在文章中,做者提出使用隨機梯度降低(Stochastic Gradient Descent)訓練網絡。在官方給出的源碼neuraltalk2中,做者給出了多種訓練網絡的優化器及其參數(rmsprop,adagrad,sgd……詳見neuraltalk2/misc/optim_updates.lua)。zsdonghao/Image-Captioning使用SGD訓練網絡,初始學習率2.0,學習率衰減因子0.5,學習率降低後每一代的數量8.0。

It is a neural net which is fully trainable using stochastic gradient descent.lua

The model is trained to maximize the likelihood of the target description sentence given the training image.spa

  • 在neuraltalk2中,LSTM層的輸入(Embedding層的輸出)向量維度和LSTM隱藏層的向量維度均設置爲512。zsdonghao/Image-Captioning的設置相同。
  • 在zsdonghao/Image-Captioning中,做者將vocabulary_size設置爲12000。

版權聲明:本文爲博主原創文章,歡迎轉載,轉載請註明做者及原文出處!翻譯

相關文章
相關標籤/搜索