Andrew ng 深度學習課程筆記

時間 2019-11-30

標籤 andrew 深度學習課程筆記简体版

原文原文鏈接

課程一神經網絡和深度學習

1. 深度學習概論

1.2 什麼是神經網絡

從Housing Price Prediction 講起 => regression 迴歸能夠當作一個簡單的單層，一個神經元的神經網絡python

1.3 用神經網絡進行監督學習

1.4 爲何深度學習會興起

Data
Computation
Algorithms: 好比sigmod -> relu 使得計算gradient descent更快

2. 神經網絡基礎

2.1 二分分類

some notations ...算法

2.2 logistic 迴歸

logistic 迴歸就是一個淺層(shallow, 實際上一個hidden layer也沒有，只有一個output layer)神經網絡bash

Give\ x,\ want\ \hat y = P(y=1|x);\  (0<=y<=1)

parameters:w\in \mathbb{R},b\in \mathbb{R}

Output:\hat y=\sigma(w^tx+b); find\ w,b

\sigma(z)=\frac{1}{1+e^{-z}}
複製代碼

2.3 logistic迴歸損失函數

使用這個損失函數便於計算gradient descent網絡

Loss(Error)\ Function : L(\hat y,y) = - (y\log\hat y + (1-y)\log(1-\hat y)) 

Cost\ Function:  J(w,b) = 1/m *\sum_{i=1}^m  L(\hat y^i,y^i) = -\frac{1}{m}*\sum_{i=1}^m(y^i\log\hat y^i + (1-y^i)\log(1-\hat y^i)) 

複製代碼

2.4 梯度降低法

w := w - \alpha \frac{dJ(w,b)}{dw};\  (\alpha:learning\ rate)

b := b - \alpha \frac{dJ(w,b)}{db}
複製代碼

2.7 計算圖

反向傳播：其實有點相似dp算法，後往前算gradient descent, 這樣有些算的結果能夠複用，計算效率大大提升框架

2.9 logistic迴歸中的梯度降低

\text {圖裏面的a是以前的} \hat y
複製代碼

分數求導：結果的分子=原式的分子求導乘以原式的分母-原式的分母求導乘以原式的分子，結果的分母=原式的分母的平方。dom

2.10 logistic迴歸on m個examples

2.11 向量化

向量化計算更高效機器學習

import numpy as np
import time

a = np.random.rand(1000000)
b = np.random.rand(1000000)
tic = time.time()
c = np.dot(a, b)
print("cost " + str((time.time() - tic)*1000) + "ms")
複製代碼

2.13 向量化的logistic迴歸

2.15 python中的廣播

2.16 python/numpy中的向量說明

不要使用秩爲1的向量，顯式使用1*n或者n*1的向量, 使用reshape和assert來確保維度符合預期ide

import numpy as np
a = np.random.randn(5) #do not use 
print("a:",a.shape,"\n", a)
b = np.random.randn(5, 1)
print("b:",b.shape,"\n", b)
c = np.random.randn(1, 5)
print("c:",c.shape,"\n", c)

a = a.reshape(5, 1)
assert(a.shape == (5, 1))
複製代碼

3. 淺層神經網絡

3.1 神經網絡概覽

3.2 神經網絡表示

3.5 向量化實現的解釋

3.6 激活函數

3.7 爲何使用非線性的激活函數

若是是線性的通過幾層以後仍是線性的，多層就沒有意義了函數

3.8 激活函數的導數

3.9 激活函數的導數

3.11 隨機初始化

多神經元爲什麼W不能初始化爲0矩陣學習

4. 深層神經網絡

4.1 深層神經網絡

4.3 覈對矩陣的維數

4.7 參數VS超參數

課程二改善深層神經網絡：超參數調試、正則化以及優化

1. 深度學習的實用層面

1.1 訓練、開發、測試集

1.2 誤差、方差

1.4 Regularization

lamda 很大會發生什麼：

1.6 Drop Out Regularization

1.8 其餘Regularization方法

early stopping

1.9 Normalizing inputs

1.10 vanishing/exploding gradients

1.11 權重初始化

1.13 Gradient Check

1.14 Gradient Check Implementation Notes

2. 優化算法

2.1 Mini-batch gradient descent

batch-size 要適配CPU/GPU memory

2.3 Exponentially weighted averages

移動平都可撫平短時間波動，將長線趨勢或週期顯現出來。數學上，移動平都可視爲一種卷積。

Bias correction

2.6 Gradient Descent with Momentum

2.7 RMSprop

2.8 Adam優化算法

Momentum + RMSprop

2.9 Learning rate decay

逐步減少Learning rate的方式

2.10 局部最優的問題

在高維空間，容易遇到saddle point可是local optima其實不容易遇到

plateaus是個問題，learning會很慢，可是相似adam的方法能減輕這個問題

3. 超參數調試、batch正則化和程序框架

3.1 搜索超參數

Try random values: don't use a grid
Coarse to fine

3.4 Batch Normalization

一個問題，在迴歸中能夠normalization在神經網絡中能否作相似的事情

經過lamda和beta能夠控制mean和variance

3.6 Batch Normalization爲何有效

By normlization values to similar range of values, it speed up learning
Batch normlization reduces the problem of input values(對於每一層) changing
Has a slight regulazation effect (like dropout, it adds some noice to each hidden layer's activations)