[CVPR2017] Visual Translation Embedding Network for Visual Relation Detection 論文筆記

時間 2019-11-17

標籤 cvpr2017 cvpr visual translation embedding network relation detection 論文筆記欄目系統網絡简体版

原文原文鏈接

http://www.ee.columbia.edu/ln/dvmm/publications/17/zhang2017visual.pdfweb

Visual Translation Embedding Network for Visual Relation Detection Hanwang Zhang† , Zawlin Kyaw‡ , Shih-Fu Chang† , Tat-Seng Chua‡ †Columbia University, ‡National University of Singapore算法

亮點網絡

視覺關係預測問題的分析與化簡：把一種視覺關係理解爲在特徵空間從主語到賓語的一種變換，頗有效、很直白
實驗設計的很棒，從多角度進行了分析對比：語言空間劃分，多任務對物體檢測的提高，零次學習等。

現有工做ide

Mature visual detection [16, 35]
Burgeoning visual captioning and question answering [2, 4]

directly bridge the visual model (e.g., CNN) and the language model (e.g., RNN), but fall short in modeling and understanding the relationships between objects.
poor generalization ability

Visual Relation Detection: a visual relation as a subject-predicate-object triplet

joint models, a relation triplet is considered as a unique class [3, 9, 33, 37].

the long-tailed distribution is an inherent defect for scalability.

separate model

modeling the large visual variance of predicates is challenging.

language priors to boost relation detection

主要思想學習

Translation Embedding 視覺關係預測的難點主要是：對於N個物體和R種謂語，有N^2R種關係，是一個組合爆炸問題。解決這個問題經常使用的辦法是：spa

估計謂語，不估計關係，缺點：對於不一樣的主語、賓語，圖像視覺差別巨大

受Translation Embedding (TransE) 啓發，文章中將視覺關係看做在特徵空間上從主語到賓語的一種映射，在低維空間上關係元組可看做向量變換，例如person+ride ≈ bike. scala

Knowledge Transfer in Relation 物體的識別和謂語的識別是互惠的。經過使用類別名、位置、視覺特徵三種特徵和端對端訓練網絡，使物體和謂語以前的隱含關係在網絡中可以學習到。設計

算法blog

Visual Translation Embeddingip

Loss function

Feature Extraction Layer

classname + location + visual feature 不一樣的特徵對不一樣的謂語（動詞、介詞、空間位置、對比）都有不同的做用

Bilinear Interpolation

In order to achieve object-relation knowledge transfer, the relation error should be back-propagated to the object detection network and thus refines the objects. We replace the RoI pooling layer with bilinear interpolation [18]. It is a smooth function of two inputs:

結果

Translation embeding: ＋18%

object detection ＋0.6% ～ 0.3%

State-of-art: