[CVPR2017] Visual Translation Embedding Network for Visual Relation Detection 論文筆記

 http://www.ee.columbia.edu/ln/dvmm/publications/17/zhang2017visual.pdfweb

Visual Translation Embedding Network for Visual Relation Detection Hanwang Zhang† , Zawlin Kyaw‡ , Shih-Fu Chang† , Tat-Seng Chua‡ †Columbia University, ‡National University of Singapore算法

亮點網絡

  • 視覺關係預測問題的分析與化簡:把一種視覺關係理解爲在特徵空間從主語到賓語的一種變換,頗有效、很直白
  • 實驗設計的很棒,從多角度進行了分析對比:語言空間劃分,多任務對物體檢測的提高,零次學習等。

現有工做ide

  • Mature visual detection [16, 35] 
  • Burgeoning visual captioning and question answering [2, 4]
    • directly bridge the visual model (e.g., CNN) and the language model (e.g., RNN), but fall short in modeling and understanding the relationships between objects. 
    • poor generalization ability
  • Visual Relation Detection: a visual relation as a subject-predicate-object triplet
    • joint models, a relation triplet is considered as a unique class [3, 9, 33, 37]. 
      • the long-tailed distribution is an inherent defect for scalability. 
    • separate model
      • modeling the large visual variance of predicates is challenging.
    • language priors to boost relation detection

主要思想學習

Translation Embedding 視覺關係預測的難點主要是:對於N個物體和R種謂語,有N^2R種關係,是一個組合爆炸問題。解決這個問題經常使用的辦法是:spa

  • 估計謂語,不估計關係,缺點:對於不一樣的主語、賓語,圖像視覺差別巨大

受Translation Embedding (TransE) 啓發,文章中將視覺關係看做在特徵空間上從主語到賓語的一種映射,在低維空間上關係元組可看做向量變換,例如person+ride ≈ bike. scala

 

Knowledge Transfer in Relation 物體的識別和謂語的識別是互惠的。經過使用類別名、位置、視覺特徵三種特徵和端對端訓練網絡,使物體和謂語以前的隱含關係在網絡中可以學習到。設計

 

算法blog

 

 

Visual Translation Embeddingip

 Loss function

 

Feature Extraction Layer

classname + location + visual feature 不一樣的特徵對不一樣的謂語(動詞、介詞、空間位置、對比)都有不同的做用

 

Bilinear Interpolation

In order to achieve object-relation knowledge transfer, the relation error should be back-propagated to the object detection network and thus refines the objects. We replace the RoI pooling layer with bilinear interpolation [18]. It is a smooth function of two inputs:

 

結果

Translation embeding: +18%

object detection +0.6% ~ 0.3%

State-of-art: 

  • Phrase Det.  +3% ~ 6%
  • Relation Det. +1%
  • Retrieval -1% ~ 2%
  • Zero-shot phrase detection
  • Phrase Det. -0.7% (without language prior)
  • Relation Det. -1.4%
  • Retrieval +0.2%

問題

  • 兩個物體之間可能有多種關係,好比person ride elephant,同時也存在person short elephant可是文章中的方法沒法表示多樣化的關係
  • 沒有使用語言先驗知識,使用多模態信息可能會有所提高
相關文章
相關標籤/搜索