深度學習課程筆記（十二） Matrix Capsule

時間 2019-11-12

原文原文鏈接

深度學習課程筆記（十二） Matrix Capsule with EM Routing html

2018-02-02 21:21:09 git

Paper: https://openreview.net/pdf/99b7cb0c78706ad8e91c13a2242bb15b7de325ad.pdf github

Blog: https://jhui.github.io/2017/11/14/Matrix-Capsules-with-EM-routing-Capsule-Network/ 算法

【Abstract】網絡

　　一個 capsule 是一組神經元，其輸出表明瞭同一個實例的不一樣屬性。capsule network 的每層包括多個 capsule。咱們一種 capsule，每一個 capsule 擁有一個邏輯單元來表示一個實例是否出現，以及一個 4*4 的矩陣來學習表示實例和視角（姿態）之間的關係。一個 layer 的 capsule 投票給該 layer 之上的許多不一樣的 capsules 的 pose matrix，經過用可訓練的視角不變的轉換矩陣（trainable viewpoint-invariant transformation matrices）乘以自身的 pose matrix，來學習表示部分和總體之間的關係（to represent part-whole relationships）。每個這樣子的投票都會被給予一個係數進行加權（each of these votes is weighted by an assignment coefficient）。這些係數能夠用指望最大化算法（the Expectation-Maximization algorithm）來迭代的進行更新，such that the output of each capsule is routed to a capsule in the layer above that receives a cluster of similar votes。轉換矩陣是經過 BP 算法進行更新的，through the unrolled iterations of EM between each pair of adjacent capsule layers. 在 smallNORB data 上，capsule 下降了測試偏差，而且也顯示了對對抗攻擊（adversarial attack）的更好的抵抗能力。機器學習

【Introduction】ide

　　CNN 是基於一個簡單的事實：一個視覺系統須要利用圖像中全部位置的相同知識（ a vision system needs to use the same knowledge at all locations in the image）。這是經過捆綁特徵檢測器的權重來實現的，使得在一個位置上學習到的 feature 在其餘位置仍然可用。Convolutional capsule 將這個不一樣位置共享的思路拓展到到包括了 the relationship between an object or object-part and the viewer. capsule 的目標是充分利用潛在的 linearity，既能夠處理視角變換問題，又能夠改善 segmentation decisions。函數

　　Capsule 利用高維的 coincidence filtering：一個熟悉的物體應該被檢測到，經過尋找其 pose matrix 的投票協約（by looking for agreement between votes for its pose matrix）。這些從 parts 的投票已經能夠被檢測到。A part produces a vote by multiplying its own pose matrix by a learned transformation matrix that represents the viewpoint invariant realtionship between the part and the whole. 隨着視角的變化，部分的和全局的 pose matrices 將會一致的進行改變，任何來自不一樣 parts 的投票都會獲得堅持。學習

　　找到高維的投票緊密聚類，agree in a mist of irrelevant votes，是將 parts 賦予 wholes 解決問題的一種方法。這件事是非平凡的，由於咱們不能在高維度的 pose space 像低緯度轉換空間那樣，能夠進行卷積。爲了解決這個挑戰，咱們利用一個快速的迭代過程，稱爲：「routing by agreement」，that updates the probability with which a part is assigned to a whole based on the proximity of the vote coming from that part to the votes coming from other parts that are assigned to that whole。這是一個有效的分割原則（segmentation principle），容許熟悉形狀的知識來引導分割，而不是僅僅利用 low-level cues，例如：在色彩或者速度上的近鄰或者一致。 capsule 和標準的神經網絡之間的一個重大區別是：一個 capsule 的激活是基於多個到來的姿態估計的對比（the acitivation of a capsule is based on a comparion between multiple incoming pose predictions）；而標準的神經網絡是基於 a single acitivity vector 和 a learned weight vector 之間的對比。
測試

【How Capsule Work】

　　神經網絡一般利用簡單的非線性函數到線性 filter 的 scalar output 上。他們可能也會用 softmax non-linearities 來將 a whole vector of logits 轉換爲 a vector of probabilities。Capsule 利用一個更加複雜的非線性函數，將 the whole set of activation probabilities and poses of the capsules in one layer 轉換爲 the acitivation probilities and poses of capsules in the next layer。

　　一個 capsule network 是由多層的 capsule 構成的。第 L 層的 capsule 集合表示爲。每個 capsule 有一個 4*4 的 pose matrix，M，以及一個激活機率，a。他們像標準的神經網絡中的激活函數同樣：他們依賴於當前的輸入，而且不會被存儲。第L層的每個 capsule i 和第 L+1 層的每個 capsule 是一個 4*4 可訓練的轉換矩陣（trainable transformation matrix），Wij。這些 Wij 以及連個學習到的 biases，是僅存的參數。capsule i 的 pose matrix 經過 Wij 進行轉換，來對 capsule j 的 pose matrix 執行一個投票 Vij = Mi Wij。第 L+1 層的 capsules 的 poses 和 activations 經過利用 non-linear routing procedure 來計算，which gets as input Vij and ai for all i, j。

　　這個非線性的過程是 the Expectation-Maximization procedure 的一個版本。它迭代的調整 mean，variance，and 第L+1層的 capsules 的激活機率，以及 assignment probabilities between all i, j。

【Using EM for routing-by-agreement】

　　讓咱們假設咱們已經決定了在一層的全部 capsule 的 the poses 以及 acitivation probabilities，如今咱們想要決定激活一層中的那些 capsules，以及如何將激活的 low-level capsule 賦予到一個激活的 high-level capsule 上。一個 high-level layer 的每一個 capsule，對應了一個 Gaussian，the lower layer 的每個激活capsule 的 pose 對應了一個 data-point。

　　利用最小化描述長度的原則，咱們能夠決定是否去激活一個 high-level capsule。

　　Choice 0：若是咱們不去激活他，咱們必須付出一個 fixed cost per data-point，以此來描述 the poses of all the lower-level capsules that are assigned to the higher-level capsule。

　　Choice 1：if we do activate the higher-level capsule we must pay a fixed cost for coding its mean and variance and the fact that it is active and then pay additional costs, pro-rated by the assignment probabilities, for describing the discrepancies between the lower-level means and the values predicted for them when the mean of the higher-level capsule is used to predict them via the inverse of the transformation matrix.

　　一個更加簡單的方法來計算描述一個 datapoint 的 cost 的方法是：use the negative log probability density of that datapoint's vote under the Gaussian distribution fitted by whatever higher-level capsule it gets assigned to （在高斯分佈的狀況下，數據點投票的負的 log 機率密度）。Choice 0 和 Choice 1 之間代價的差別，而後經過每一次迭代的 logistic function 來決定 high-level capsule's 激活機率。

　　利用咱們對 choice 1 的有效估計，解釋 a whole data-point i 的代價，經過利用 capsule j 擁有一個座標對齊的 covariance matrix，就是簡單的解釋 vote Vij 的每個維度 h 的代價的求和。這就是簡單的，其中，輸入 P 是第 h 個成分的機率密度（the probability density of the h-th component of the vectorized vote Vij under j's Gaussian model for dimension h ）。

　　累加單個維度 h 的全部 lower-level capsules，咱們有：

其中，是賦予給 j 的數據量，k 是常數，$V^h_{ij}$ 是 Vij 的維度 h 的值。Turning on j increases the description length for the means of the lower-level capsules assigned to j by the sum of the cost over all dimensions, so we define the acitivation function of capsules j to be:

其中，$-b_j$ 表明了描述 capsule j 的 mean 和 variance 代價。

　　爲了可以定下來 the pose parameters 和第 L+1 層的 capsules 的激活函數，咱們運行幾回迭代 EM 算法（一般是 3次）。這個經過整個 capsule layer 的非線性過程，是一種用 EM 算法進行 cluter finding 的形式，咱們稱之爲：EM Routing 。

【The Capsules Architecture】

Spread Loss：

　　爲了使得訓練儘量對模型的超參數不敏感，咱們利用「spread loss」來直接最大化 the activation of the target class 和 the activation of the other classes 之間的 gap。若是激活錯了類別，$a_i$，接近 margin，m：

Blog: 「Understanding Matrix capsules with EM Routing (Based on Hinton's Capsule Networks)」

咱們首先覆蓋 the matrix capsules，用 EM （Expectation Maximization）routing 來分類帶有不一樣視角的圖像（to classify images with different viewpoints）。

CNN Challengs：

　　標準 CNN 對空間關係並無處理的很好，而後咱們在這裏會討論怎麼用 capsule network 來解決這些不足。

　　概念上來講，CNN 訓練神經網絡來處理不一樣的特徵角度（different feature orientations （0°， 20°， -20°）），with a top level face detection neuron.

　　爲了解決這些問題，咱們添加更多的卷積層以及特徵圖。然而，這種方法嘗試記住 dataset，而不是 generalize a solution。這須要大量的訓練數據來處理不一樣的視角來避免 overfitting。可是，像小孩這樣均可以以不多的樣本，就能夠識別不一樣的數字。咱們現有的 deep learning models，包括 CNN，都沒法很好的用好這些數據（inefficient in utilizing datapoints）。

Adversaires：

　　CNN 對簡單的移動，旋轉，或者 resize 單獨的 feature，也並無很好的處理能力。CNN 這個時候，依然會認爲下面這個圖是一張 human face：

　　另外，CNN 針對對抗樣本也機關用盡，典型的案例以下：

Capsule：

　　A capsule captures the likeliness of a feature and its variant. （一個 capsule 捕獲了一個特徵及其變種的似然性）。因此，the capsule 不只僅檢測 a feature，也會被訓練來學習和檢測其變種（variants）。

　　例如，一樣的 network layer 也能夠檢測按照順時針旋轉的 face。

Equivariance：

　　Equivariance is the detection of objects that can transform to each other。

Matrix capsule:

　　一個矩陣膠囊（matrix capsule）捕獲了 the activation（likeliness）similar to that of a neuron, but also captures a 4*4 pose matrix。在計算機圖形學中，一個 pose matrix 定義了一個目標轉化和旋轉，等價於物體視角的變換。

　　例如，第二行的物體都和第一行相同，但僅僅視角不一樣而已。在 matrix capsule 中，咱們訓練一個模型來捕獲 the pose information（orientation，azimuths etc ...）。固然，這也僅僅和其餘 deep learning 方法同樣，這就是咱們的直覺，而並無保證。

　　EM（Expectation Maximization）routing 的目標是：group capsules to form a part-whole relationship using a clustering technique (EM)。

　　用聚類的方法來構建部分-總體關係。。。

　　在機器學習中，咱們用 EM 聚類方法來聚類 datapoints 到高斯分佈（we use EM clustering to cluster datapoints into Gaussian distributions）。

　　例如，咱們聚類 datapoints 到兩個 clusters by two gaussian distributions。而後，咱們經過對應的高斯分佈來表示 datapoints：

　　In the face detection example, each of the mouth, eyes and nose detection capsules in the lower layer makes predictions (votes) on the pose matrices of its possible parent capsules. Each vote is a predicted value for a parent capsule’s pose matrix, and it is computed by multiplying its own pose matrix with a transformation matrix that we learn from the training data.

　　咱們採用 the EM routing 來將多個 capsules 概括爲 parent capsule：

　　也就是說，若是鼻子，嘴巴，眼睛膠囊都投票給一個類似的 pose matrix value，咱們將其聚類以構成 parent capsule：the face capsule

　　一個高層的特徵（a face）被檢測到，經過尋找下一層膠囊的投票所構成的 agreement。咱們用 EM routing 來聚類 capsules，that have close proximity of the corresponding votes。

Gaussian mixture model & Expectation Maximization (EM)

　　咱們先看看什麼是 EM。一個高斯混合模型將 datapoints 聚類到：混合高斯分佈（a mixture of Gaussian distributions）。下面，咱們將 datapoints 聚類到黃色和紅色 cluster：

　　關於 EM 算法（Expectation Maximization Algorithm）的具體介紹，能夠參考：http://www.javashuo.com/article/p-npmrcusy-ch.html

Using EM for Routing-by-Agreement

　　一個高層的特徵（如：a face）經過來自底層的 capsule 的投票來被檢測到。一個給 parent capsule 的投票是經過：將該膠囊的姿態矩陣（pose matrix） 和 視角不變轉換（viewpoint invariant transformation） 相乘獲得的（by multipling the pose matrix of capsule with a viewpoint invariant transformation）。

　　他不但學習到了 a face 是由什麼組成的，並且也要確保在某些轉換以後 the parent capsule 的姿態信息（the pose information）與其 sub-components 相匹配。

　　下面是 routing-by-agreement 的可視化。咱們將有類似投票的 capsules 進行分組，after transform the pose and with a viewpoint invariant transformation。

　　即便視角可能改變，the pose matrices 和 the votes 可能也相應的作出改變。在咱們的例子當中，當 face 被旋轉以後，投票的位置可能由紅色點，變成了粉紅色的點。除非，EM routing 是基於近鄰的，因此 EM routing 仍然能夠成功的聚類相同的 children capsules。因此，the transformation matrices 對於物體的不一樣視角，卻仍然是相同的：viewpoint invariant。對於不一樣的視角，咱們僅僅須要 one set of the transformation matrices 以及 one parent capsule。