三分鐘帶你對 Softmax 劃重點

時間 2019-11-26

標籤三分 3分 softmax 重點简体版

原文原文鏈接

紅色石頭的我的網站：redstonewill.compython

1. 什麼是Softmax

Softmax 在機器學習和深度學習中有着很是普遍的應用。尤爲在處理多分類（C > 2）問題，分類器最後的輸出單元須要Softmax 函數進行數值處理。關於Softmax 函數的定義以下所示：git

S _ i = \frac { e ^ { V _ i} } {\sum _ i ^ C e ^ { V _ i } }

其中，Vi 是分類器前級輸出單元的輸出。i 表示類別索引，總的類別個數爲 C。Si 表示的是當前元素的指數與全部元素指數和的比值。Softmax 將多分類的輸出數值轉化爲相對機率，更容易理解和比較。咱們來看下面這個例子。github

一個多分類問題，C = 4。線性分類器模型最後輸出層包含了四個輸出值，分別是：算法

V=\left[
\begin{matrix}
-3 \\
2 \\
-1 \\
0
\end{matrix}
\right]

通過Softmax處理後，數值轉化爲相對機率：bash

S=\left[
\begin{matrix}
0.0057 \\
0.8390 \\
0.0418 \\
0.1135
\end{matrix}
\right]

很明顯，Softmax 的輸出表徵了不一樣類別之間的相對機率。咱們能夠清晰地看出，S1 = 0.8390，對應的機率最大，則更清晰地能夠判斷預測爲第1類的可能性更大。Softmax 將連續數值轉化成相對機率，更有利於咱們理解。機器學習

實際應用中，使用 Softmax 須要注意數值溢出的問題。由於有指數運算，若是 V 數值很大，通過指數運算後的數值每每可能有溢出的可能。因此，須要對 V 進行一些數值處理：即 V 中的每一個元素減去 V 中的最大值。函數

S\_i=\frac{e^{V_i-D}}{\sum_i^Ce^{V_i-D} }

相應的python示例代碼以下：oop

scores = np.array([123, 456, 789])    # example with 3 classes and each having large scores
scores -= np.max(scores)    # scores becomes [-666, -333, 0]
p = np.exp(scores) / np.sum(np.exp(scores))
複製代碼

2. Softmax 損失函數

咱們知道，線性分類器的輸出是輸入 x 與權重係數的矩陣相乘：s = Wx。對於多分類問題，使用 Softmax 對線性輸出進行處理。這一小節咱們來探討下 Softmax 的損失函數。學習

S_i=\frac{e^{S_{y_i}}}{\sum_{j=1}^Ce^{S_j} }

其中，Syi是正確類別對應的線性得分函數，Si 是正確類別對應的 Softmax輸出。測試

因爲 log 運算符不會影響函數的單調性，咱們對 Si 進行 log 操做：

S_i=log\frac{e^{S_{y_i}}}{\sum_{j=1}^Ce^{S_j} }

咱們但願 Si 越大越好，即正確類別對應的相對機率越大越好，那麼就能夠對 Si 前面加個負號，來表示損失函數：

L\_i=-S_i=-log\frac{e^{S_{y_i}}}{\sum_{j=1}^Ce^{S_j} }

對上式進一步處理，把指數約去：

L_i=-log\frac{e^{S_{y_i}}}{\sum_{j=1}^Ce^{S_j} }=-(s_{y_i}-log\sum_{j=1}^Ce^{s_j})=-s_{y_i}+log\sum_{j=1}^Ce^{s_j}

這樣，Softmax 的損失函數就轉換成了簡單的形式。

舉個簡單的例子，上一小節中獲得的線性輸出爲：

假設 i = 1 爲真實樣本，計算其損失函數爲：

L_i=-2+log(e^{-3}+e^2+e^{-1}+e^0)=0.1755

3. Softmax 反向梯度

推導了 Softmax 的損失函數以後，接下來繼續對權重參數進行反向求導。

Softmax 線性分類器中，線性輸出爲：

其中，下標 i 表示第 i 個樣本。

求導過程的程序設計分爲兩種方法：一種是使用嵌套 for 循環，另外一種是直接使用矩陣運算。

使用嵌套 for 循環，對權重 W 求導函數定義以下：

def softmax_loss_naive(W, X, y, reg):
 """ Softmax loss function, naive implementation (with loops) Inputs have dimension D, there are C classes, and we operate on minibatches of N examples. Inputs: - W: A numpy array of shape (D, C) containing weights. - X: A numpy array of shape (N, D) containing a minibatch of data. - y: A numpy array of shape (N,) containing training labels; y[i] = c means that X[i] has label c, where 0 <= c < C. - reg: (float) regularization strength Returns a tuple of: - loss as single float - gradient with respect to weights W; an array of same shape as W """
 # Initialize the loss and gradient to zero.
 loss = 0.0
 dW = np.zeros_like(W)

 num_train = X.shape[0]
 num_classes = W.shape[1]
 for i in xrange(num_train):
   scores = X[i,:].dot(W)
   scores_shift = scores - np.max(scores)
   right_class = y[i]
   loss += -scores_shift[right_class] + np.log(np.sum(np.exp(scores_shift)))
   for j in xrange(num_classes):
     softmax_output = np.exp(scores_shift[j]) / np.sum(np.exp(scores_shift))
     if j == y[i]:
       dW[:,j] += (-1 + softmax_output) * X[i,:]
     else:
       dW[:,j] += softmax_output * X[i,:]

 loss /= num_train
 loss += 0.5 * reg * np.sum(W * W)
 dW /= num_train
 dW += reg * W

 return loss, dW
複製代碼

使用矩陣運算，對權重 W 求導函數定義以下：

def softmax_loss_vectorized(W, X, y, reg):
 """ Softmax loss function, vectorized version. Inputs and outputs are the same as softmax_loss_naive. """
 # Initialize the loss and gradient to zero.
 loss = 0.0
 dW = np.zeros_like(W)

 num_train = X.shape[0]
 num_classes = W.shape[1]
 scores = X.dot(W)
 scores_shift = scores - np.max(scores, axis = 1).reshape(-1,1)
 softmax_output = np.exp(scores_shift) / np.sum(np.exp(scores_shift), axis=1).reshape(-1,1)
 loss = -np.sum(np.log(softmax_output[range(num_train), list(y)]))
 loss /= num_train
 loss += 0.5 * reg * np.sum(W * W)

 dS = softmax_output.copy()
 dS[range(num_train), list(y)] += -1
 dW = (X.T).dot(dS)
 dW = dW / num_train + reg * W  

 return loss, dW
複製代碼

實際驗證代表，矩陣運算速度要比嵌套循環快不少，特別是在訓練樣本數量多的狀況下。咱們使用 CIFAR-10 數據集中約5000個樣本對兩種求導方式進行測試對比：

tic = time.time()
loss_naive, grad_naive = softmax_loss_naive(W, X_train, y_train, 0.000005)
toc = time.time()
print('naive loss: %e computed in %fs' % (loss_naive, toc - tic))

tic = time.time()
loss_vectorized, grad_vectorized = softmax_loss_vectorized(W, X_train, y_train, 0.000005)
toc = time.time()
print('vectorized loss: %e computed in %fs' % (loss_vectorized, toc - tic))

grad_difference = np.linalg.norm(grad_naive - grad_vectorized, ord='fro')
print('Loss difference: %f' % np.abs(loss_naive - loss_vectorized))
print('Gradient difference: %f' % grad_difference)
複製代碼

結果顯示爲：

naive loss: 2.362135e+00 computed in 14.680000s
vectorized loss: 2.362135e+00 computed in 0.242000s
Loss difference: 0.000000
Gradient difference: 0.000000
複製代碼

顯然，此例中矩陣運算的速度要比嵌套循環快60倍。因此，當咱們在編寫機器學習算法模型時，儘可能使用矩陣運算，少用嵌套循環，以提升運算速度。

4. Softmax 與 SVM

Softmax線性分類器的損失函數計算相對機率，又稱交叉熵損失「Cross Entropy Loss」。線性 SVM 分類器和 Softmax 線性分類器的主要區別在於損失函數不一樣。SVM 使用 hinge loss，更關注分類正確樣本和錯誤樣本之間的距離「Δ = 1」，只要距離大於 Δ，就不在意到底距離相差多少，忽略細節。而 Softmax 中每一個類別的得分函數都會影響其損失函數的大小。舉個例子來講明，類別個數 C = 3，兩個樣本的得分函數分別爲[10, -10, -10]，[10, 9, 9]，真實標籤爲第0類。對於 SVM 來講，這兩個 Li 都爲0；但對於Softmax來講，這兩個 Li 分別爲0.00和0.55，差異很大。

關於 SVM 線性分類器，我在上篇文章裏有所介紹，傳送門：

基於線性SVM的CIFAR-10圖像集分類

接下來，談一下正則化參數 λ 對 Softmax 的影響。咱們知道正則化的目的是限制權重參數 W 的大小，防止過擬合。正則化參數 λ 越大，對 W 的限制越大。例如，某3分類的線性輸出爲 [1, -2, 0]，相應的 Softmax 輸出爲[0.7, 0.04, 0.26]。假設，正類類別是第0類，顯然，0.7遠大於0.04和0.26。

若使用正則化參數 λ，因爲限制了 W 的大小，獲得的線性輸出也會等比例縮小：[0.5, -1, 0]，相應的 Softmax 輸出爲[0.55, 0.12, 0.33]。顯然，正確樣本和錯誤樣本之間的相對機率差距變小了。

也就是說，正則化參數 λ 越大，Softmax 各種別輸出越接近。大的 λ 其實是「均勻化」正確樣本與錯誤樣本之間的相對機率。可是，機率大小的相對順序並無改變，這點須要留意。所以，也不會影響到對 Loss 的優化算法。

5. Softmax 實際應用

使用 Softmax 線性分類器，對 CIFAR-10 圖片集進行分類。

使用交叉驗證，選擇最佳的學習因子和正則化參數：

# Use the validation set to tune hyperparameters (regularization strength and
# learning rate). You should experiment with different ranges for the learning
# rates and regularization strengths; if you are careful you should be able to
# get a classification accuracy of over 0.35 on the validation set.
results = {}
best_val = -1
best_softmax = None
learning_rates = [1.4e-7, 1.5e-7, 1.6e-7]
regularization_strengths = [8000.0, 9000.0, 10000.0, 11000.0, 18000.0, 19000.0, 20000.0, 21000.0]

for lr in learning_rates:
   for reg in regularization_strengths:
       softmax = Softmax()
       loss = softmax.train(X_train, y_train, learning_rate=lr, reg=reg, num_iters=3000)
       y_train_pred = softmax.predict(X_train)
       training_accuracy = np.mean(y_train == y_train_pred)
       y_val_pred = softmax.predict(X_val)
       val_accuracy = np.mean(y_val == y_val_pred)
       if val_accuracy > best_val:
           best_val = val_accuracy
           best_softmax = softmax
       results[(lr, reg)] = training_accuracy, val_accuracy
   
# Print out results.
for lr, reg in sorted(results):
   train_accuracy, val_accuracy = results[(lr, reg)]
   print('lr %e reg %e train accuracy: %f val accuracy: %f' % (
               lr, reg, train_accuracy, val_accuracy))
   
print('best validation accuracy achieved during cross-validation: %f' % best_val)
複製代碼

訓練結束後，在測試圖片集上進行驗證：

# evaluate on test set
# Evaluate the best softmax on test set
y_test_pred = best_softmax.predict(X_test)
test_accuracy = np.mean(y_test == y_test_pred)
print('softmax on raw pixels final test set accuracy: %f' % (test_accuracy, ))
複製代碼

softmax on raw pixels final test set accuracy: 0.386000

權重參數 W 可視化代碼以下：

# Visualize the learned weights for each class
w = best_softmax.W[:-1,:] # strip out the bias
w = w.reshape(32, 32, 3, 10)

w_min, w_max = np.min(w), np.max(w)

classes = ['plane', 'car', 'bird', 'cat', 'deer', 'dog', 'frog', 'horse', 'ship', 'truck']
for i in range(10):
   plt.subplot(2, 5, i + 1)
   
   # Rescale the weights to be between 0 and 255
   wimg = 255.0 * (w[:, :, :, i].squeeze() - w_min) / (w_max - w_min)
   plt.imshow(wimg.astype('uint8'))
   plt.axis('off')
   plt.title(classes[i])
複製代碼