該系列僅在原課程基礎上部分知識點添加個人學習筆記,或相關推導補充等。如有錯誤,還請批評指教。在學習了 Andrew Ng 課程的基礎上,爲了更方便的查閱複習,將其整理成文字。因本人一直在學習英語,所以該系列以英文爲主,同時也建議讀者以英文爲主,中文輔助,以便後期進階時,爲學習相關領域的學術論文做鋪墊。- ZJ
轉載請註明作者和出處:ZJ 微信公衆號-「SelfImprovementLab」
知乎:https://zhuanlan.zhihu.com/c_147249273
CSDN:http://blog.csdn.net/junjun_zhao/article/details/79122927
3.9 Trying a Softmax classifier (訓練一個 Softmax 分類器 )
(字幕來源:網易雲課堂)
In the last video, you learned about the Softmax layer and the Softmax activation function.In this video, you deepen your understanding of Softmax classification,and also learn how to train a model that uses a Softmax layer.Recall our earlier example where the output layer computes as follows.So we have four classes,C = 4 then can be (4,1) dimensional vector and we said we compute t which is this temporary variable that performs element-wise exponentiation.And then finally, if the activation function for your output layer, is the Softmax activation function,then your outputs will be this.It’s basically taking the temporary variable tand normalizing it to sum to 1.So this then becomes .So you notice that in the z vector, the biggest element was 5, andthe biggest probability ends up being this first probability.
上一個視頻中我們學習了 Softmax 層,和 Softmax 激活函數,在這個視頻中 你將更深入地瞭解 Softmax 分類,並學習如何訓練一個使用了 Softmax 層的模型,回憶一下我們之前舉的例子,輸出層計算出的 如下,我們有四個分類,C 等於 4。 可以是 4*1維向量,我們計算了臨時變量 ,對元素進行冪運算,最後 如果你的輸出層的激活函數 ,是 Softmax 激活函數,那麼輸出就會是這樣的,簡單來說就是用臨時變量 將它歸一化,使總和爲 1,於是這就變成了 ,你注意到在向量 中 最大的元素是 5,而最大的概率也就是第一種概率。
The name Softmax comes from contrasting it to what’s called a hard max which would have taken the vector z and map it to this vector.So hard max function will look at the elements of z and just put an 1 in the position of the biggest element of z and then 0s everywhere else.And so this is a very hard max where the biggest element gets a output of 1 and everything else gets an output of 0. Whereas in contrast,a Softmax is a more gentle mapping from z to these probabilities.So, I’m not sure if this is a great name but at least, that was the intuition behind why we call it a Softmax ,all this in contrast to the hard max.And one thing I didn’t really show but had alluded to is that Softmax regression or the Softmax activation function generalizes the logistic activation function to C classes rather than just two classes.And it turns out that if C = 2, then Softmax with C = 2 essentially reduces to logistic regression.And I’m not going to prove this in this video but the rough outline for the proof is that if C = 2 and if you apply Softmax ,then the output layer, , will output two numbers if C = 2,so maybe it outputs 0.842 and 0.158, right?And these two numbers always have to sum to 1.And because these two numbers always have to sum to 1, they’re actually redundant.And maybe you don’t need to bother to compute two of them,maybe you just need to compute one of them.And it turns out that the way you end up computing that number reduces tothe way that logistic regression is computing its single output.So that wasn’t much of a proof but the takeaway from this is that Softmax regression is a generalization of logistic regression to more than two classes.
Softmax 這個名稱的來源是與所謂 hard max 對比,hard max 會把向量 z 變成這個向量,hard max 函數會觀察 z 的元素,然後在 z 中最大元素的位置放上 1,其他位置放上 0,所以這是一個很硬 (hard) 的 max,也就是最大的元素的輸出爲 1,其他的輸出都爲 0,與之相反, Softmax 所做的從 z 到這些概率的映射更爲溫和,我不知道這是不是一個好名字,但至少這就是 Softmax 這一名稱背後所包含的想法,與 hard max 正好相反,有一點我沒有細講 但之前已經提到過的,就是 Softmax 迴歸或 Softmax 激活函數,將 logistic 激活函數推廣到 C 類 而不僅僅是兩類,結果就是如果 C 等於 2 那麼 C 等於 2 的 Softmax 實際上變回到了 logistic 迴歸,我不會在這個視頻中給出證明,但是大致的證明思路是這樣的,如果 C 等於 2 並且你應用了 Softmax ,那麼輸出層 將會輸出兩個數字,如果 C 等於 2 的話,也許它會輸出 0.842 和 0.158 對吧,這兩個數字加起來要等於 1,因爲它們的和必須爲 1 其實它們是冗餘的,也許你不需要計算兩個,而只需要計算其中一個,結果就是你最終計算那個數字的方式又回到了,logistic 迴歸計算單個輸出的方式,這算不上是一個證明 但我們可以從中得出結論, Softmax 迴歸將 logistic 迴歸推廣到了兩種分類以上。
Now let’s look at how you would actually train a neural network with a Softmax output layer.So in particular,let’s define the loss functions you use to train your neural network.Let’s take an example.Let’s see of an example in your training set where the target output,the ground truth label is 0 1 0 0.So the example from the previous video,this means that this is an image of a cat because it falls into Class 1.And now let’s say that your neural network is currently outputting y hat equals…so y hat would be a vector of probabilities sum to 1…0.1, 0.4, so you can check that sums to 1, and this is going to be .So the neural network’s not doing very well in this example because this is actually a cat and assigned only a 20% chance that this is a cat.So didn’t do very well in this example.So what’s the loss function you would want to use to train this neural network?In Softmax classification,the loss we typically use is negative sum of j=1 through 4.And it’s really sum from 1 to C in the general case.We’re going to just use 4 here– of log y hat of j.So let’s look at our single example above to better understand what happens.Notice that in this example, because those are 0s and only .So if you look at this summation,all of the terms with 0 values of were equal to 0.
接下來我們來看,怎樣訓練帶有 Softmax 輸出層的神經網絡,具體而言,我們先定義訓練神經網絡時會用到的損失函數,舉個例子,我們來看看訓練集中某個樣本的目標輸出,真實標籤是0 1 0 0,用上一個視頻中講到過的例子,這表示這是一張貓的圖片 因爲它屬於類 1,現在我們假設你的神經網絡輸出的是 等於, 是一個包括總和爲 1 的概率的向量,0.1 0.4 你可以看到總和爲 1 這就是 ,對於這個樣本 神經網絡的表現不佳,這實際上是一隻貓 但卻只分配到 20% 是貓的概率,所以在本例中表現不佳,那麼你想用什麼損失函數來訓練這個神經網絡?在 Softmax 分類中,我們一般用到的損失函數是負的 j 從 1 到 4 的和,實際上一般來說是從 1 到 C 的和,我們這裏就用 4 ,我們來看上面的單個樣本,來更好地理解整個過程,注意在這個樣本中, ,因爲這些都是 0 只有 ,如果你看這個求和,所有含有值爲 0 的 的項都等於 0。
And the only term you’re left with is -y2 log y hat 2,because when you sum over the indices of j,all the terms will end up 0, except when j is equal to 2.And because , this is just -log y hat 2.So what this means is that,if your learning algorithm is trying to make this small because you use gradient descent to try to reduce the loss on your training set.Then the only way to make this small is to make this small.And the only way to do that is to make y hat 2 as big as possible.And these are probabilities, so they can never be bigger than 1.But this kind of makes sensebecause x for this example is the picture of a cat,then you want that output probability to be as big as possible.So more generally, what this loss function does isit looks at whatever is the ground truth class in your training set,and it tries to make the corresponding probability of that class as high as possible.If you’re familiar with maximum likelihood estimation statistics,this turns out to be a form of maximum likelyhood estimation.But if you don’t know what that means, don’t worry about it.The intuition we just talked about will suffice.Now this is the loss on a single training example.How about the cost J on the entire training set.So, the cost of setting of the parameters and so on,of all the ways of biases,you define that as pretty much what you’d guess,sum of your entire training sets of the loss,your learning algorithm’s predictions are summed over your training samples.And so, what you do is use gradient descentin order to try to minimize this cost.
最後只剩下 ,因爲當你按照下標 j 全部加起來,所有的項都爲 0 除了 j 等於 2 時,又因爲 所以它就等於 ,這就意味着,如果你的學習算法試圖將它變小,因爲梯度下降法是用來減少訓練集的損失的,要使它變小的唯一方式就是使它變小,要想做到這一點 就需要使 儘可能大,因爲這些是概率 所以不可能比 1 大,但這的確也講得通,因爲在這個例子中 x 是貓的圖片,你就需要這項輸出的概率儘可能地大,概括來講 損失函數所做的就是,它找到你的訓練集中的真實類別,然後試圖使該類別相應的概率儘可能地高,如果你熟悉統計學中的最大似然估計,這其實就是最大似然估計的一種形式,但如果你不知道那是什麼意思 也不用擔心,用我們剛剛講過的算法思維也足夠了,這是單個訓練樣本的損失,整個訓練集的損失 J 又如何呢,也就是設定參數的代價之類的,還有各種形式的偏差的代價,它的定義你大致也能猜到,就是整個訓練集損失的總和,把你的訓練算法對所有訓練樣本的預測都加起來,因此你要做的就是用梯度下降法,使這裏的損失最小化。
Finally, one more implementation detail.Notice that because C is equal to 4, y is a 4 by 1 vector, andy hat is also a 4 by 1 vector.So if you’re using a vectorized implementation,the matrix capital Y is going to be , , through , stacked horizontally.And so for example, if this example up here is your first training examplethen the first column of this matrix Y will be 0 1 0 0and then maybe the second example is a dog,maybe the third example is a none of the above, and so on.And then this matrix Y will end up being a 4 by m dimensional matrix.And similarly, Y hat will be y hat 1 stacked up horizontally going through y hat mso this is actually y hat 1 or the output on the first training exampleThen y hat with these 0.3, 0.2, 0.1, and 0.4, and so on.And y hat itself will also be 4 by m dimensional matrix.
最後還有一個實現細節,注意因爲 是一個 向量, 也是一個 向量,如果你使用向量化實現,矩陣大寫 就是 到 的橫向排列,例如如果上面這個樣本是你的第一個訓練樣本,那麼矩陣 的第一列就是0 1 0 0,也許第二個樣本是一隻狗,也許第三個樣本是以上均不符合 等等,那麼這個矩陣 最終就是一個 維矩陣,類似的 就是 …橫向排列 一直到 ,這個其實就是 或是第一個訓練樣本的輸出,那麼 就是0.3 0.2 0.1 0.4 等等, 本身也是一個 維矩陣。
Finally, let’s take a look at how you’d implement gradient descent when you have a Softmax output layer.So this output layer will compute which is C by 1in our example, 4 by 1 andthen you apply the Softmax activation function to get , or y hat.And then that in turn allows you to compute the loss.So we’ve talked about how to implement the forward propagation step of a neural network to get these outputs and to compute that loss.How about the backpropagation step, or gradient descent?Turns out that the key step orthe key equation you need to initialize backprop is this expression,that the derivative with respect to z at the last layer, this turns out,you can compute this y hat, the 4 by 1 vector, minus y, the 4 by 1 vector.So you notice that all of these are going to be 4 by 1 vectors when you have 4 classes and C by 1 in the more general case.And so this going by our usual definition of what is dz,this is the partial derivative for the cost function with respect to .If you are an expert in calculus, you can derive this yourself.Or if you’re an expert in calculus,you can try to derive this yourself,but using this formula will also just work fine,if you have a need to implement this from scratch.With this, you can then compute and then sort of start off the backprop processto compute all the derivatives you need throughout your neural network.But it turns out that in this week’s primary exercise,we’ll start to use one of the deep learning program frameworks and for those program frameworks,usually it turns out you just need to focus on getting the forward prop right.And so long as you specify it as a program framework, the forward prop pass,the program framework will figure out how to do back prop,how to do the backward pass for you.
最後我們來看一下,在有 Softmax 輸出層時如何實現梯度下降法,這個輸出層會計算 它是 的,在這個例子中是 ,然後你用 Softmax 激活函數來得到 或者說 ,然後又能由此算出損失,我們已經講了如何實現神經網絡前向傳播的步驟,來得到這些輸出 並計算損失,那麼反向傳播步驟或者梯度下降法又如何呢?其實初始化反向傳播,所需的關鍵步驟或者說關鍵方程是這個表達式,對於最後一層的 的導數 其實,你可以用 這個 向量減去 這個 向量,你可以看到這些都會是 向量,當你有 4 個分類時,在一般情況下就是 ,這符合我們對 的一般定義,這是對於 的損失函數的偏導數,如果你精通微積分 就可以自己推導,或者說如果你精通微積分,可以試着自己推導,但是如果你需要從零開始使用這個公式,它也一樣有用,有了這個 你就可以計算 ,然後開始反向傳播的過程,計算整個神經網絡中所需的所有導數,但是在這周的初級練習中,我們將開始使用一種深度學習編程框架,對於這些編程框架,通常你只需專注於把前向傳播做對,只要你將它指明爲編程框架 前向傳播,它自己會弄明白怎樣反向傳播,會幫你實現反向傳播。
So this expression is worth keeping in mind for if you ever need to implement Softmax regression, or Softmax classification from scratch.Although you won’t actually need this in this week’s primary exercise because the program framework you use will take care of this derivative computation for you.So that’s it for Softmax classification,with it you can now implement learning algorithms to categorize inputs into not just one of two classes,but one of C different classes.Next, I want to show you some of the deep learning program frameworks which can make you much more efficient in terms of implementing deep learning algorithms.Let’s go on to the next video to discuss that.
這個表達式值得牢記,如果你需要從頭開始,實現 Softmax 迴歸或者 Softmax 分類,但其實在這周的初級練習中你不會用到它,因爲編程框架會幫你搞定導數計算, Softmax 分類就講到這裏,有了它 你就可以運用學習算法,將輸入分成不止兩類,而是 C 個不同類別,接下來我想向你展示一些深度學習編程框架,可以讓你在實現深度學習算法時更加高效,讓我們在下個視頻中一起討論。
理解 Sotfmax
爲什麼叫做Softmax?我們以前面的例子爲例,由 到 的計算過程如下:
通常我們判定模型的輸出類別,是將輸出的最大值對應的類別判定爲該模型的類別,也就是說最大值爲的位置1,其餘位置爲0,這也就是所謂的「hardmax」。而Sotfmax將模型判定的類別由原來的最大數字5,變爲了一個最大的概率0.842,這相對於「hardmax」而言,輸出更加「soft」而沒有那麼「hard」。
Sotfmax迴歸 將 logistic迴歸 從二分類問題推廣到了多分類問題上。
Softmax 的Loss function
在使用Sotfmax層時,對應的目標值 y 以及訓練結束前某次的輸出的概率值 分別爲:
Sotfmax 使用的 Loss function爲:
在訓練過程中,我們的目標是最小化Loss function,由目標值我們可以知道, ,所以代入 中,有:
所以爲了最小化Loss function,我們的目標就變成了使得 的概率儘可能的大。
也就是說,這裏的損失函數的作用就是找到你訓練集中的真實的類別,然後使得該類別相應的概率儘可能地高,這其實是最大似然估計的一種形式。
對應的Cost function如下:
Softmax 的梯度下降
在Softmax層的梯度計算公式爲: