Unpaired/Partially/Unsupervised Image Captioning

這篇涉及到如下三篇論文:函數

Unpaired Image Captioning by Language Pivoting (ECCV 2018)學習

Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)ui

Unsupervised Image Caption (CVPR 2019)spa

 

1. Unpaired Image Captioning by Language Pivoting (ECCV 2018)

Abstract翻譯

做者提出了一種經過語言樞軸(language pivoting)的方法來解決沒有成對的圖片和描述的image caption問題(unpaired image captioning problem)。設計

Our method can effectively capture the characteristics of an image captioner from the pivot language(Chinese) and align it to the target language (English) using another pivot-target (Chinese-English) sentence parallel corpus.3d

Introductioncode

因爲encoder-decoder結構須要大量的image-caption pairs來訓練,一般這樣的大規模標記數據是難以得到的,研究人員開始思考經過非成對的數據或者是用半監督的方法來利用其餘領域成對的標記數據來實現無監督學習的目的。在本文中,做者但願經過使用源語言——中文做爲樞軸語言,來消除輸入圖片和目標語言——英文描述之間的間隔,這須要有圖片——中文描述以及中文——英文兩個成對的數據集,從而達到不須要有圖片——英文描述成對數據集來實現圖片到英文描述生成的目的。blog

做者說這種思想來源於機器翻譯領域的相關研究,使用這種策略的機器翻譯方法一般分爲兩步,首先將源語言翻譯成樞軸語言,而後將樞軸語言翻譯成目標語言。可是image caption與機器翻譯又有不少不一樣的地方:1.image-Chinese caption和Chinese-English中句子的風格和詞彙分佈有很大區別;2.source-to-pivot轉換的錯誤會傳遞到pivot-to-target圖片

Use AIC-ICC and AIC-MT as the training datasets and two datasets (MSCOCO and Flickr30K) as the validation datasets

i: source image, x: pivot language sentence, y: target language, y_hat: ground truth captions in target language(對於這裏的y_hat,是從MSCOCO訓練集裏面隨機抽取的描述性語句(captions),用來訓練下autoencoder)

 

這篇文章的思想比較容易理解,難點是把Image-to-Pivot和Pivot-to-Target聯繫起來,克服兩個數據集語言風格和詞彙分佈不一致這兩個問題。

2. Show, Tell and Discriminate: Image Captioning by Self-retrieval with Partially Labeled Data (ECCV 2018)

做者在這篇文章中指出,目前已有的caption模型傾向於複製訓練集中的句子或短語,生成的描述一般是泛化和模板化的,缺少生成區分性描述的能力。

基於GAN的caption模型能夠提高句子的多樣性,但在標準的評價指標上會有比較差的表現。

做者提出在Captioning Module上結合一個Self-retrieval Module,來達到generate discriminative captions的目的。

 

3. Unsupervised Image Caption (CVPR 2019)

這是一篇真正的無監督方法來作Image Caption的文章,不 rely on any labeled image sentence pairs

與Unsupervised Machine Translation相比,Unsupervised Image Caption任務更具挑戰是由於圖像和文本是兩個不一樣的模態,有很大的差異。

模型由an image encoder, a sentence generator,a sentence discriminator組成。

Encoder:

普通的image encoder便可,做者採用的是Inception-V4

Generator:

由LSTM組成的decoder

Discriminator:

由LSTM來實現,用來distinguish whether a partial sentence is a real sentence from the corpus or is generated by the model.

 

Training:

因爲do not have any paired image-sentence,就不能用有監督的方式來訓練模型了,因而做者設計了三種目標函數來實現Unsupervised Image Captioning

Adversarial Caption Generation:

Visual Concept Distillation:

Bi-directional Image-Sentence Reconstruction:

Image Reconstruction: reconstruct the image features instead of the full image

Sentence Reconstruction: the discriminator can encode one sentence and project it into the common latent space, which can be viewed as one image representation related to the given sentence. The generator can reconstruct the sentence based on the obtained representation.

Integration:Generator:

Discriminator:

 

Initialization

It challenging to adequately train our image captioning model from scratch with the given unpaired data, need an initialization pipeline to pre-train the generator and discriminator.

For generator:

Firstly, build a concept dictionary consisting of the object classes in the OpenImages dataset

Second, train a concept-to-sentence(con2sen) model using the sentence corpus only

Third, detect the visual concepts in each image using the existing visual concept detector. Use the detected concepts and the concept-to-sentence model to generate a pseudo caption for each image

Fourth, train the generator with the pseudo image-caption pairs

 

For discriminator, initialized by training an adversarial sentence generation model on the sentence corpus.

相關文章
相關標籤/搜索