A-Softmax的總結及與L-Softmax的對比——SphereFace

時間 2019-11-06

標籤 softmax 總結對比 sphereface 简体版

原文原文鏈接

$\quad$【引言】SphereFace在MegaFace數據集上識別率在2017年排名第一，用的A-Softmax Loss有着清晰的幾何定義，能在比較小的數據集上達到不錯的效果。這個是他們總結成果的論文：SphereFace: Deep Hypersphere Embedding for Face Recognition。我對論文作一個小的總結。java

1. A-Softmax的推導

回顧一下二分類下的Softmax後驗機率，即：git

$$a=b+c \tag{1.0}$$github

$$ \begin{split} p_1 = \frac{\exp({W}_1^Tx+b_1)}{\exp({W}_1^Tx+b_1)+\exp({W}_2^Tx+b_2)} \cr p_2 = \frac{\exp({W}_2^Tx+b_2)}{\exp({W}_1^Tx+b_1)+\exp({W}_2^Tx+b_2)} \cr \end{split} \tag{1.1} $$編程

$\quad$顯然決策的分界在當$p_1 = p_2$時，因此決策界面是$(W_1-W_2)x+b_1-b_2=0$。咱們能夠將$W_i^Tx+b_i$寫成$\|W_i^T\|\cdot\|x\|\cos(\theta_i)+b_i$，其中$\theta_i$是$W_i$與$x$的夾角，如對$W_i$歸一化且設偏置$b_i$爲零（$\|W_i\|=1$，$b_i=0$），那麼當$p_1 = p_2$時，咱們有$\cos(\theta_1)-\cos(\theta_2)=0$。從這裏能夠看到，如裏一個輸入的數據特徵$x_i$屬於$y_i$類，那麼$\theta_{yi}$應該比其它全部類的角度都要小，也就是說在向量空間中$W_{yi}$要更靠近$x_i$。
$\quad$咱們用的是Softmax Loss，對於輸入$x_i$，Softmax Loss $L_i$定義如下：框架

$$ \begin{split} L_i &= -\log(\frac{\exp(W_{yi}^Tx_i+b_{yi})}{\sum_j\exp(W_{j}^Tx_i+b_{j})}) \cr &= -\log(\frac{\exp(\|W_{yi}^T\|·\|x_i\|\cos(\theta_{yi,i})+b_{yi})}{\sum_j\exp(\|W_{j}^T\|·\|x_i\|\cos(\theta_{j,i})+b_{j})}) \cr \end{split} \tag{1.2} $$ide

式$(1.2)$中的$j\in[1,K]$，其中$K$類別的總數。上面咱們限制了一些條件：$\|W_i\|=1$，$b_i=0$，由這些條件，能夠獲得修正的損失函數（也就是論文中因此說的modified softmax loss）：
$$ L_{modified} = \frac{1}{N}\sum_i-\log(\frac{\exp(\|x_i\|\cos(\theta_{yi,i}))}{\sum_j\exp(\|x_i\|\cos(\theta_{j,i}))}) \tag{1.3} $$
$\quad$在二分類問題中，當$\cos(\theta_1)>\cos(\theta_2)$時，能夠肯定屬於類別1，但分類1與分類2的決策面是同一分，說明分類1與分類2之間的間隔(margin)至關小，直觀上的感受就是分類不明顯。若是要讓分類1與分類2有一個明顯的間隔，能夠作兩個決策面，對於類別1的決策平面爲：$\cos(m\theta_1)=\cos(\theta_2)$，對於類別2的策平面爲：$\cos(\theta_1)=\cos(m\theta_2)$，其中$m\geq2,m\in N$。$m$是整數的目的是爲了方便計算，由於能夠利用倍角公式，$m\geq2$說明與該分類的最大夾角要比其它類的小小夾角還要小$m$倍。若是$m=1$,那麼類別1與類別2的決策平面是同一個平面，若是$m\geq2$v，那麼類別1與類別2的有兩個決策平面，相隔多大將會在性質中說明。從上述的說明與$L_{modified}$能夠直接獲得A-Softmax Loss：函數

$$ L_{ang} = \frac{1}{N}\sum_i-\log(\frac{\exp(\|x_i\|\cos(m\theta_{yi,i}))}{\exp(\|x_i\|\cos(m\theta_{yi,i}))+\sum_{j\neq y_i}\exp(\|x_i\|\cos(\theta_{j,i}))}) \tag{1.4} $$學習

其中$\theta_{yi,i}\in[0, \frac{\pi}{m}]$，由於$\theta_{yi,i}$在這個範圍以外可可能會使得$m\theta_{y_i,i}>\theta_{j,i},j\neq y_i$（這樣就不屬於分類$y_i$了），但$\cos(m\theta_1)>\cos(\theta_2)$仍可能成立，而咱們Loss方程用的仍是$\cos(\theta)$。爲了不這個問題，能夠從新設計一個函數來替代$\cos(m\theta_{y_i,i})$，定義$\psi(\theta_{y_i,i})=(-1)^k\cos(m\theta_{y_i,i})-2k$，其中$\theta_{y_i,i}\in[\frac{k\pi}{m},\frac{(k+1)\pi}{m}]$，$且k\in[1,k]$。這個函數的定義可使得$\psi$隨$\theta_{y_i,i}$單調遞減，若是$m\theta_{y_i,i}>\theta_{j,i},j\neq y_i$, 那麼必有$\psi(\theta_{y_i,i})<\cos(\theta_{j,i})$，反而亦然，這樣能夠避免上述的問題，因此有：測試

$$ L_{ang} = \frac{1}{N}\sum_i-\log(\frac{\exp(\|x_i\|\psi(\theta_{yi,i}))}{\exp(\|x_i\|\psi(\theta_{yi,i}))+\sum_{j\neq y_i}\exp(\|x_i\|\cos(\theta_{j,i}))}) \tag{1.5} $$

$\quad$對於以上三種二分類問題的Loss（多分類是差很少的狀況）的決策面，能夠總結以下表：

$$ \begin{array}{|c|l|} \hline \text{Loss Funtion} & \text{Decision Boundary} \\ \hline \text{Softmax Loss} & (W_1-W_2)x+b_1-b_2=0\\ \hline \text{Modified Softmax Loss} & \|x\|(\cos\theta_1-\cos\theta_2)=0 \\ \hline \text{A-Softmax Loss} & Class1: \|x\|(\cos m\theta_1-\cos\theta_2)=0 \\ & Class2: \|x\|(\cos \theta_1-\cos m\theta_2)=0\\ \hline \end{array} $$

$\quad$論文中還給出了這三種不一樣Loss的幾何意義，能夠看到的是普通的softmax（Euclidean Margin Loss）是在歐氏空間中分開的，它映射到歐氏空間中是不一樣的區域的空間，決策面是一個在歐氏空間中的平面，能夠分隔不一樣的類別。Modified Softmax Loss與A-Softmax Loss的不一樣之處在於兩個不一樣類的決策平面是同一個，不像A-Softmax Loss，有兩個分隔的決策平面且決策平面分隔的大小仍是與$m$的大小成正相關，以下圖所示。

2. A-Softmax Loss的性質

性質1：A-Softmax Loss定義了一個大角度間隔的學習方法，$m$越大這個間隔的角度也就越大，相應區域流形的大小就越小，這就致使了訓練的任務也越困難。
這個性質是至關容易理解的，如圖1所示：這個間隔的角度爲$(m-1)\theta_1$，因此$m$越大，則間隔的角度就越小；同時$m\theta_1<\pi$，當因此$m$越大，則相應的區域流形$\theta_1$就越小。

定義1：$m_{min}$被定義爲當$m>m_{min}$時有類內間的最大角度特徵距離小於類間的最小角度特徵距離。
性質2：在二分類問題中：$m_{min}>2+\sqrt{3}$，有多分類問題中：$m_{min}\geq 3$。
證實：1.對於二分類問題，設$W_1$、$W_2$分別是類別1與類別2的權重，$W_1$與$W_2$之間的夾角是$\theta_{12}$，輸入的特徵爲$x$，那麼權重與輸入特徵之間的夾角就決定了輸入的特徵屬於那個類別，不失通常性地能夠認爲輸入的特徵性於類別1，則有$m\theta_1<\theta_2$。當$x$在$\theta_{12}$之間時，如圖2所示，能夠由$m\theta_1=\theta_2$求出這時$\theta_1$的最大值爲$\theta_{max1}^{in}=\frac{\theta_{12}}{m+1}$。

<center>圖2：$x$在$\theta_{12}$之間時的示意圖</center>

當$x$在$\theta_{12}$以外時，第一種狀況是當$\theta_{12} \leq \frac{m-1}{m}\pi$，如圖3所示，能夠由$m\theta_1=\theta_2$求出這時$\theta_1$的最大值爲$\theta_{max1}^{out}=\frac{\theta_{12}}{m-1}$，還有一種狀況就是當$\theta_1$與$\theta_2$不是同一側時，$\theta_{12} < \frac{m-1}{m}\pi$，如圖4所示，能夠獲得：$\theta_{max1}^{out}=\frac{2\pi-\theta_{12}}{m+1}$。

<center>圖3：$x$在$\theta_{12}$以外時的示意圖</center>

<center>圖4：$x$在$\theta_{12}$以外時的示意圖</center>

不管是上述中的第一種狀況仍是第二種狀況，類間的最小角度特徵距離如圖5所示狀況中的$\theta_{inter}$,因此有：$\theta_{inter}=(m-1)\theta_1=\frac{m-1}{m+1}\theta_{12}$。

<center>圖5：最小的類間距離示意圖</center>

以上的分析能夠總結爲如下方程：

$$ \begin{split} \frac{\theta_{12}}{m-1} + \frac{\theta_{12}}{m+1} \leq \frac{m-1}{m+1}\theta_{12}, \theta_{12} \leq \frac{m-1}{m}\pi \cr \frac{2\pi - \theta_{12}}{m-1} + \frac{\theta_{12}}{m+1} \leq \frac{m-1}{m+1}\theta_{12}, \theta_{12} > \frac{m-1}{m}\pi \cr \end{split} \tag{2.1} $$

解上述不等式能夠行到$m_{min} \geq 2+\sqrt{3}$。
2.對於$K$類($K\geq 3$)問題，設$\theta_i^{i+1}$是權重$W_i$與$W_{i+1}$的夾角，顯然最好的狀況是$W_i$是均勻分佈的，因此有$\theta_i^{i+1}=\frac{2\pi}{K}$。對於類內的最大距離與類間的小距離有如下方程：

$$ \frac{\theta_{i}^{i+1}}{m+1} +　\frac{\theta_{i-1}^{i}}{m+1} < min\{\frac{(m-1)\theta_{i}^{i+1}}{m+1}, \frac{(m-1)\theta_{i-1}^{i}}{m+1}\} \tag{2.2} $$

能夠解得$m_{min} \geq 3$。綜合上面對$m_{min}$的討論，論文中取了$m=4$。

3. A-Softmax的幾何意義

我的認爲A-Softmax是基於一個假設：不一樣的類位於一個單位超球表面的不一樣區域。從上面也能夠知道它的幾何意義是權重所表明的在單位超球表面的點，在訓練的過程當中，同一類的輸入映射到表面上會慢慢地向中心點（這裏的中心點大部分時候和權重的意義至關）彙集，而到不一樣類的權重（或者中心點）慢慢地分散開來。$m$的大小是控制同一類點彙集的程度，從而控制了不一樣類之間的距離。從圖6能夠看到，不一樣的$m$對映射分佈的影響（做者畫的圖真好看，也不知道做者是怎麼畫出來的）。

<center>圖6：不一樣的$m$對映射分佈的影響</center>

4. 源碼解讀

$\quad$做者用Caffe實現了A-Softmax，能夠參考這個wy1iu/SphereFace，來解讀其中的一些細節。在實際的編程中，不須要直接實現式$(1.4)$中的$L_{ang}$，能夠在SoftmaxOut層前面加一層$MarginInnerProduct$，這個文件sphereface_model.prototxt的最後以下面引用所示，能夠看到做者是加多了一層。具體的C++代碼在margin_inner_product_layer.cpp。

############### A-Softmax Loss ##############
layer {
  name: "fc6"
  type: "MarginInnerProduct"
  bottom: "fc5"
  bottom: "label"
  top: "fc6"
  top: "lambda"
  param {
    lr_mult: 1
    decay_mult: 1
  }
  margin_inner_product_param {
    num_output: 10572
    type: QUADRUPLE
    weight_filler {
      type: "xavier"
    }
    base: 1000
    gamma: 0.12
    power: 1
    lambda_min: 5
    iteration: 0
  }
}
layer {
  name: "softmax_loss"
  type: "SoftmaxWithLoss"
  bottom: "fc6"
  bottom: "label"
  top: "softmax_loss"
}

$\quad$瞭解這個實現的思路後，關鍵看前向和後向傳播，如今大部分的深度學習框架都支持自動求導了（如tensorflow,mxnet的gluon），但我仍是建議你們寫後向傳播，由於自動求導會消耗顯存或者內存（看運行的設備）並且確定不如本身寫的效率高。在Forword的過程當中，有以下細節：

$$ \begin{split} \cos \theta_{i,j} &= \frac{\vec{x_i}\cdot\vec{W_j}}{\|\vec{x_i}\|\cdot\|\vec{W_j}\|} \frac{\vec{x_i}\cdot\vec{W_{norm_j}}}{\|\vec{x_i}\|} \cr \cos 2\theta &= 2\cos^2 \theta -1 \cr \cos 3\theta &= 4\cos^2 \theta -3 \cos \theta \cr \cos 4\theta &= 8\cos^4 \theta -8\cos^2 \theta - 1 \cr \end{split} \tag{4.1} $$

$$ M_{i,j} = \begin{cases} \|\vec{x_i}\|\cos \theta_{i,j} = \vec{x_i}\cdot\vec{W_{norm_j}}, & \text {if $j \neq y_i$ } \\ \|\vec{x_i}\|\psi(\theta_{i,j}), & \text{if $j = y_i$ } \end{cases} \tag{4.2} $$

$M$是輸出，代碼中的$sign\_3\_=(-1)^k, sign\_4\_=-2k$，Caffe的代碼以下：

template <typename Dtype>
void MarginInnerProductLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) 
{
  iter_ += (Dtype)1.;
  Dtype base_ = this->layer_param_.margin_inner_product_param().base();
  Dtype gamma_ = this->layer_param_.margin_inner_product_param().gamma();
  Dtype power_ = this->layer_param_.margin_inner_product_param().power();
  Dtype lambda_min_ = this->layer_param_.margin_inner_product_param().lambda_min();
  lambda_ = base_ * pow(((Dtype)1. + gamma_ * iter_), -power_);
  lambda_ = std::max(lambda_, lambda_min_);
  top[1]->mutable_cpu_data()[0] = lambda_;
  
  /************************* normalize weight *************************/
  Dtype* norm_weight = this->blobs_[0]->mutable_cpu_data();
  Dtype temp_norm = (Dtype)0.;
  for (int i = 0; i < N_; i++) {
      temp_norm = caffe_cpu_dot(K_, norm_weight + i * K_, norm_weight + i * K_);
      temp_norm = (Dtype)1./sqrt(temp_norm);
      caffe_scal(K_, temp_norm, norm_weight + i * K_);
  }

  /************************* common variables *************************/
  // x_norm_ = |x|
  const Dtype* bottom_data = bottom[0]->cpu_data();
  const Dtype* weight = this->blobs_[0]->cpu_data();
  Dtype* mutable_x_norm_data = x_norm_.mutable_cpu_data();
  for (int i = 0; i < M_; i++) {
    mutable_x_norm_data[i] = sqrt(caffe_cpu_dot(K_, bottom_data + i * K_, bottom_data + i * K_));
  }
  Dtype* mutable_cos_theta_data = cos_theta_.mutable_cpu_data();
  caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasTrans, M_, N_, K_, (Dtype)1.,
      bottom_data, weight, (Dtype)0., mutable_cos_theta_data);
  for (int i = 0; i < M_; i++) {
    caffe_scal(N_, (Dtype)1./mutable_x_norm_data[i], mutable_cos_theta_data + i * N_);
  }
  // sign_0 = sign(cos_theta)
  caffe_cpu_sign(M_ * N_, cos_theta_.cpu_data(), sign_0_.mutable_cpu_data());

  /************************* optional variables *************************/
  switch (type_) {
  case MarginInnerProductParameter_MarginType_SINGLE:
    break;
  case MarginInnerProductParameter_MarginType_DOUBLE:
    // cos_theta_quadratic
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)2., cos_theta_quadratic_.mutable_cpu_data());
    break;
  case MarginInnerProductParameter_MarginType_TRIPLE:
    // cos_theta_quadratic && cos_theta_cubic
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)2., cos_theta_quadratic_.mutable_cpu_data());
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)3., cos_theta_cubic_.mutable_cpu_data());
    // sign_1 = sign(abs(cos_theta) - 0.5)
    caffe_abs(M_ * N_, cos_theta_.cpu_data(), sign_1_.mutable_cpu_data());
    caffe_add_scalar(M_ * N_, -(Dtype)0.5, sign_1_.mutable_cpu_data());
    caffe_cpu_sign(M_ * N_, sign_1_.cpu_data(), sign_1_.mutable_cpu_data());
    // sign_2 = sign_0 * (1 + sign_1) - 2
    caffe_copy(M_ * N_, sign_1_.cpu_data(), sign_2_.mutable_cpu_data());
    caffe_add_scalar(M_ * N_, (Dtype)1., sign_2_.mutable_cpu_data());
    caffe_mul(M_ * N_, sign_0_.cpu_data(), sign_2_.cpu_data(), sign_2_.mutable_cpu_data());
    caffe_add_scalar(M_ * N_, - (Dtype)2., sign_2_.mutable_cpu_data());
    break;
  case MarginInnerProductParameter_MarginType_QUADRUPLE:
    // cos_theta_quadratic && cos_theta_cubic && cos_theta_quartic
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)2., cos_theta_quadratic_.mutable_cpu_data());
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)3., cos_theta_cubic_.mutable_cpu_data());
    caffe_powx(M_ * N_, cos_theta_.cpu_data(), (Dtype)4., cos_theta_quartic_.mutable_cpu_data());
    // sign_3 = sign_0 * sign(2 * cos_theta_quadratic_ - 1)
    caffe_copy(M_ * N_, cos_theta_quadratic_.cpu_data(), sign_3_.mutable_cpu_data());
    caffe_scal(M_ * N_, (Dtype)2., sign_3_.mutable_cpu_data());
    caffe_add_scalar(M_ * N_, (Dtype)-1., sign_3_.mutable_cpu_data());
    caffe_cpu_sign(M_ * N_, sign_3_.cpu_data(), sign_3_.mutable_cpu_data());
    caffe_mul(M_ * N_, sign_0_.cpu_data(), sign_3_.cpu_data(), sign_3_.mutable_cpu_data());
    // sign_4 = 2 * sign_0 + sign_3 - 3
    caffe_copy(M_ * N_, sign_0_.cpu_data(), sign_4_.mutable_cpu_data());
    caffe_scal(M_ * N_, (Dtype)2., sign_4_.mutable_cpu_data());
    caffe_add(M_ * N_, sign_4_.cpu_data(), sign_3_.cpu_data(), sign_4_.mutable_cpu_data());
    caffe_add_scalar(M_ * N_, - (Dtype)3., sign_4_.mutable_cpu_data());
    break;
  default:
    LOG(FATAL) << "Unknown margin type.";
  }

對於後面傳播，求推比較麻煩，並且在做者的源碼中訓練用了很多的trick，並不能經過梯度測試，我寫出推導過程，方便你們在看代碼的時候能夠知道做用用了哪些trick，做者對這些trick的解釋是有助於模型的穩定收斂，並無給出原理上的解釋。

當$y_i \neq j$時，有（注意做者源碼中對$W$求導有明顯的兩個錯誤，一個是做者只對$W_norm$求導，對不是對$W$,二個是沒有考慮到$y_i\neq j$的狀況）：

$$ \begin{split} \frac{\partial M_{i,j}}{\partial x_{i,k}}&= \frac{\partial (\vec{x_i}\cdot\vec{W_{norm_j}})}{\partial x_{i,k}} = W_{norm_{k,j}} \cr \frac{\partial M_{i,j}}{\partial W_{k,j}}&= \frac{\partial (\vec{x_i}\cdot\vec{W_{j}}/\|\vec{W_j}\|)}{\partial W_{k,j}} = \frac{1}{\|\vec{W_j}\|}\frac{\partial (\vec{x_i}\cdot\vec{W_{j}})}{\partial W_{k,j}}+(\vec{x_i}\cdot\vec{W_{j}})\frac{\partial (1/\|\vec{W_j}\|)}{\partial W_{k,j}} \cr &= \frac{x_{i,k}}{\|\vec{W_j}\|} - \frac{W_{norm_{k,j}}\cos \theta_{i,j} \|\vec{x_i}\|}{\|\vec{W_j}\|} \end{split} \tag{4.3} $$

在這裏我僅於$m=4$爲例子，當$y_i=j,m=4$，有：

$$ \begin{split} if \quad M_{1,i,j}&=\|\vec{x_i}\|\cos(\theta_{i,j})\cr M_{i,j}&=\|\vec{x_i}\|\psi(\theta_{i,j}) = (-1)^k[8\|\vec{x_i}\|^{-3}M_{1,i,j}^4-8\|\vec{x_i}\|^{-1}M_{1,i,j}^2 + \|\vec{x_i}\|] - 2k\|\vec{x_i}\| \cr \frac{\partial M_{i,j}}{\partial x_{i,k}}&= ((-1)^k(-24\|\vec{x_i}\|^{-4}M_{1,i,j}^4 + 8 \|\vec{x_i}\|^{-2}M_{1,i,j}^2 + 1) -2k)\frac{\partial\|\vec{x}\|}{\partial x_{i,k}}\cr & + (-1)^k(32\|\vec{x_i}\|^{-3}M_{1,i,j}^3 - 16\|\vec{x_i}\|^{-1}M_{1,i,j})\frac{\partial M_{1,i,j}}{\partial x_{i,k}} \cr &= ((-1)^k(-24\cos^4 \theta_{i,j} + 8 \cos^2 \theta_{i,j} + 1) -2k)x_{i,k}\cr & + (-1)^k(32\cos^3 \theta_{i,j} - 16\cos \theta_{i,j})W_{k,j}\cr \frac{\partial M_{i,j}}{\partial W_{k,j}}&= (-1)^k(32\cos^3 \theta_{i,j} - 16\cos \theta_{i,j})(\frac{x_{i,k}}{\|\vec{W_j}\|} - \frac{W_{norm_{k,j}}\cos \theta_{i,j} \|\vec{x_i}\|}{\|\vec{W_j}\|})\cr \end{split} \tag{4.4} $$

要注意的是上述的$i,j,k$分別第i個樣本、第j個輸出特徵和第k個輸入特徵。上面的僅是推導偏導數的過程，並無涉及到梯度殘差的反向傳播，若是上層傳過來的梯度殘差爲$\Delta$，本層的向下層傳播的殘差爲$\delta$（一個樣本中的一個特徵要對全部的輸出累加），權重的更新值爲$\zeta$（一個權重要對全部的樣本量累加），則能夠獲得：

$$ \begin{split} \delta_{i,k} = \sum_j \frac{\partial M_{i,j}}{\partial x_{i,k}}\Delta_{i,j} \cr \zeta_{k,j} = \sum_i \frac{\partial M_{i,j}}{\partial W_{k,j}}\Delta_{i,j} \end{split} \tag{4.5} $$

Caffe代碼以下：

template <typename Dtype>
void MarginInnerProductLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {

  const Dtype* top_diff = top[0]->cpu_diff();
  const Dtype* bottom_data = bottom[0]->cpu_data();
  const Dtype* label = bottom[1]->cpu_data();
  const Dtype* weight = this->blobs_[0]->cpu_data();
 
  // Gradient with respect to weight
  if (this->param_propagate_down_[0]) {
    caffe_cpu_gemm<Dtype>(CblasTrans, CblasNoTrans, N_, K_, M_, (Dtype)1.,
        top_diff, bottom_data, (Dtype)1., this->blobs_[0]->mutable_cpu_diff());
  }
  
  // Gradient with respect to bottom data
  if (propagate_down[0]) {
    Dtype* bottom_diff = bottom[0]->mutable_cpu_diff();
    const Dtype* x_norm_data = x_norm_.cpu_data();
    caffe_set(M_ * K_, Dtype(0), bottom_diff);
    switch (type_) {
    case MarginInnerProductParameter_MarginType_SINGLE: {
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, K_, N_, (Dtype)1.,
        top_diff, this->blobs_[0]->cpu_data(), (Dtype)0.,
        bottom[0]->mutable_cpu_diff());
      break;
    }
    case MarginInnerProductParameter_MarginType_DOUBLE: {
      const Dtype* sign_0_data = sign_0_.cpu_data();
      const Dtype* cos_theta_data = cos_theta_.cpu_data();
      const Dtype* cos_theta_quadratic_data = cos_theta_quadratic_.cpu_data();
      for (int i = 0; i < M_; i++) {
        const int label_value = static_cast<int>(label[i]);
        for (int j = 0; j < N_; j++) {
          if (label_value != j) {
            // 1 / (1 + lambda) * w
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j], 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
          } else {
            // 4 * sign_0 * cos_theta * w
            Dtype coeff_w = (Dtype)4. * sign_0_data[i * N_ + j] * cos_theta_data[i * N_ + j];
            // 1 / (-|x|) * (2 * sign_0 * cos_theta_quadratic + 1) * x
            Dtype coeff_x = (Dtype)1. / (-x_norm_data[i]) * ((Dtype)2. * 
                            sign_0_data[i * N_ + j] * cos_theta_quadratic_data[i * N_ + j] + (Dtype)1.);
            Dtype coeff_norm = sqrt(coeff_w * coeff_w + coeff_x * coeff_x);
            coeff_w = coeff_w / coeff_norm;
            coeff_x = coeff_x / coeff_norm;
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_w, 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_x, 
                            bottom_data + i * K_, (Dtype)1., bottom_diff + i * K_);
          }
        }
      }
      // + lambda/(1 + lambda) * w
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, K_, N_, lambda_/((Dtype)1. + lambda_),
        top_diff, this->blobs_[0]->cpu_data(), (Dtype)1.,
        bottom[0]->mutable_cpu_diff());
      break;
    }
    case MarginInnerProductParameter_MarginType_TRIPLE: {
      const Dtype* sign_1_data = sign_1_.cpu_data();
      const Dtype* sign_2_data = sign_2_.cpu_data();
      const Dtype* cos_theta_quadratic_data = cos_theta_quadratic_.cpu_data();
      const Dtype* cos_theta_cubic_data = cos_theta_cubic_.cpu_data();
      for (int i = 0; i < M_; i++) {
        const int label_value = static_cast<int>(label[i]);
        for (int j = 0; j < N_; j++) {
          if (label_value != j) {
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j], 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
          } else {
            // sign_1 * (12 * cos_theta_quadratic - 3) * w
            Dtype coeff_w = sign_1_data[i * N_ + j] * ((Dtype)12. * 
                            cos_theta_quadratic_data[i * N_ + j] - (Dtype)3.);
            // 1 / (-|x|) * (8 * sign_1 * cos_theta_cubic - sign_2) * x
            Dtype coeff_x = (Dtype)1. / (-x_norm_data[i]) * ((Dtype)8. * sign_1_data[i * N_ + j] * 
                              cos_theta_cubic_data[i * N_ + j] - sign_2_data[i * N_ +j]);
            Dtype coeff_norm = sqrt(coeff_w * coeff_w + coeff_x * coeff_x);
            coeff_w = coeff_w / coeff_norm;
            coeff_x = coeff_x / coeff_norm;
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_w, 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_x, 
                            bottom_data + i * K_, (Dtype)1., bottom_diff + i * K_);
          }
        }
      }
      // + lambda/(1 + lambda) * w
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, K_, N_, lambda_/((Dtype)1. + lambda_),
        top_diff, this->blobs_[0]->cpu_data(), (Dtype)1.,
        bottom[0]->mutable_cpu_diff());
      break;
    }
    case MarginInnerProductParameter_MarginType_QUADRUPLE: {
      const Dtype* sign_3_data = sign_3_.cpu_data();
      const Dtype* sign_4_data = sign_4_.cpu_data();
      const Dtype* cos_theta_data = cos_theta_.cpu_data();
      const Dtype* cos_theta_quadratic_data = cos_theta_quadratic_.cpu_data();
      const Dtype* cos_theta_cubic_data = cos_theta_cubic_.cpu_data();
      const Dtype* cos_theta_quartic_data = cos_theta_quartic_.cpu_data();
      for (int i = 0; i < M_; i++) {
        const int label_value = static_cast<int>(label[i]);
        for (int j = 0; j < N_; j++) {
          if (label_value != j) {
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j], 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
          } else {
            // 1 / (1 + lambda) * sign_3 * (32 * cos_theta_cubic - 16 * cos_theta) * w
            Dtype coeff_w = sign_3_data[i * N_ + j] * ((Dtype)32. * cos_theta_cubic_data[i * N_ + j] -
                                (Dtype)16. * cos_theta_data[i * N_ + j]);
            // 1 / (-|x|) * (sign_3 * (24 * cos_theta_quartic - 8 * cos_theta_quadratic - 1) + 
            //                        sign_4) * x
            Dtype coeff_x = (Dtype)1. / (-x_norm_data[i]) * (sign_3_data[i * N_ + j] * 
                            ((Dtype)24. * cos_theta_quartic_data[i * N_ + j] - 
                            (Dtype)8. * cos_theta_quadratic_data[i * N_ + j] - (Dtype)1.) - 
                             sign_4_data[i * N_ + j]);
            Dtype coeff_norm = sqrt(coeff_w * coeff_w + coeff_x * coeff_x);
            coeff_w = coeff_w / coeff_norm;
            coeff_x = coeff_x / coeff_norm;
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_w, 
                            weight + j * K_, (Dtype)1., bottom_diff + i * K_);
            caffe_cpu_axpby(K_, (Dtype)1. / ((Dtype)1. + lambda_) * top_diff[i * N_ + j] * coeff_x, 
                            bottom_data + i * K_, (Dtype)1., bottom_diff + i * K_);
          }
        }
      }
      // + lambda/(1 + lambda) * w
      caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, M_, K_, N_, lambda_/((Dtype)1. + lambda_),
        top_diff, this->blobs_[0]->cpu_data(), (Dtype)1.,
        bottom[0]->mutable_cpu_diff());
      break;
    }
    default: {
      LOG(FATAL) << "Unknown margin type.";
    }
    }
  }
}

A-Softmax的效果

在訓練模型(training)用的是A-Softmax函數，但在判別分類結果（vilidation）用的是餘弦類似原理，以下圖7所示：

所用的模型如圖8所示：

效果以下所示（詳細的對比，請看原文）：

A-Softmax在較小的數據集合上有着良好的效果且理論具備不錯的可解釋性，它的缺點也明顯就是計算量相對比較大，也許這就是做者在論文中沒有測試大數據集的緣由。

與L-Softmax的區別

A-Softmax與L-Softmax的最大區別在於A-Softmax的權重歸一化了，而L-Softmax則沒的。A-Softmax權重的歸一化致使特徵上的點映射到單位超球面上，而L-Softmax則不沒有這個限制，這個特性使得二者在幾何的解釋上是不同的。如圖10所示，若是在訓練時兩個類別的特徵輸入在同一個區域時，以下圖10所示。A-Softmax只能從角度上分度這兩個類別，也就是說它僅從方向上區分類，分類的結果如圖11所示；而L-Softmax，不只能夠從角度上區別兩個類，還能從權重的模（長度）上區別這兩個類，分類的結果如圖12所示。在數據集合大小固定的條件下，L-Softmax能有兩個方法分類，訓練可能沒有使得它在角度與長度方向都分離，致使它的精確可能不如A-Softmax。