VisualPytorch發佈域名+雙服務器以下:
http://nag.visualpytorch.top/static/ (對應114.115.148.27)
http://visualpytorch.top/static/ (對應39.97.209.22)python
\(E(XY)=E(X)E(Y)\)
\(D(X)=E(X^2)-E(X)^2\)
\(D(X+Y)=D(X)+D(Y)\)
\(\Longrightarrow D(XY) = D(X)D(Y)+D(X)E(Y)^2+D(Y)E(X)^2=D(X)D(Y)\)
\(H_{11}=\sum_{i=0}^n X_i*W_{1i}\)git
\(\Longrightarrow D(H_{11})=\sum_{i=0}^n D(X_i)*D(W_{1i})=n*1*1=n\)
\(std(H_{11}=\sqrt n)\)數組
若仍使 \(D(H_1)=nD(X)D(W)=1\)
\(\Longrightarrow D(W)=\frac{1}{n}\)緩存
import os import torch import random import numpy as np import torch.nn as nn from common_tools import set_seed set_seed(3) # 設置隨機種子 class MLP(nn.Module): def __init__(self, neural_num, layers): super(MLP, self).__init__() self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)]) self.neural_num = neural_num def forward(self, x): for (i, linear) in enumerate(self.linears): x = linear(x) # x = torch.relu(x) print("layer:{}, std:{}".format(i, x.std())) if torch.isnan(x.std()): print("output is nan in {} layers".format(i)) break return x def initialize(self): for m in self.modules(): if isinstance(m, nn.Linear): nn.init.normal_(m.weight.data, std=1) layer_nums = 100 neural_nums = 256 batch_size = 16 net = MLP(neural_nums, layer_nums) net.initialize() inputs = torch.randn((batch_size, neural_nums)) # normal: mean=0, std=1 output = net(inputs) print(output)
將W設置爲0均值,1標準差的標準正太分佈,出現以下所示梯度爆炸現象。正如所料,每層std增長的倍數大概爲\(\sqrt{256}=16\) ,將W的標準差設爲 np.sqrt(1/self.neural_num)
則正常。服務器
layer:0, std:16.0981502532959 layer:1, std:253.29345703125 layer:2, std:3982.99951171875 ... layer:30, std:2.2885405881461517e+37 layer:31, std:nan output is nan in 31 layers tensor([[ 4.9907e+37, -inf, inf, ..., inf, -inf, inf], [ -inf, inf, 2.1733e+38, ..., 9.1766e+37, -4.5777e+37, 3.3680e+37], [ 1.4215e+38, -inf, inf, ..., -inf, inf, inf], ..., [-9.2355e+37, -9.9121e+37, -3.7809e+37, ..., 4.6074e+37, 2.2305e+36, 1.2982e+38], [ -inf, inf, -inf, ..., -inf, -2.2394e+38, 2.0295e+36], [ -inf, inf, 2.1518e+38, ..., -inf, 1.6132e+38, -inf]], grad_fn=<MmBackward>)
對於不一樣的激活函數,對W的標準差初始化也不一樣,保持數據尺度維持在恰當範圍,一般方差爲1網絡
Sigmoid, tanh------Xavier
app
Relu------Kaimingdom
\(D(W)=\frac{2}{n_i}\)函數
\(D(W)=\frac{2}{(1+a^2) n_i}\) , LeakyRelu學習
a = np.sqrt(6 / (self.neural_num + self.neural_num)) tanh_gain = nn.init.calculate_gain('tanh') a *= tanh_gain nn.init.uniform_(m.weight.data, -a, a) # 同 nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain) nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num)) # 同 nn.init.kaiming_normal_(m.weight.data)
Xavier均勻分佈
Xavier正態分佈
Kaiming均勻分佈
Kaiming正態分佈
均勻分佈
正態分佈
常數分佈
正交矩陣初始化
單位矩陣初始化
稀疏矩陣初始化
nn.init.calculate_gain(nonlinearity, param=None) x = torch.randn(10000) out = torch.tanh(x) gain = x.std() / out.std() # 1.5909514427185059 tanh_gain = nn.init.calculate_gain('tanh') # 1.6666666666666667 # 至關於每通過一層tanh,x的標準差縮小1.6倍
主要功能:計算激活函數的方差變化尺度
主要參數
損失函數(Loss Function):衡量模型輸出與真實標籤的差別
\(Loss=f(\hat y , y)\)
代價函數(Cost Function):
\(Cost = \frac{1}{N}\sum_i^N f(\hat y_i, y)\)
目標函數(Objective Function):
\(Obj=Cost+Regularization\)
nn.CrossEntropyLoss調用過程:
用步進(Step into)的調試方法從loss_functoin = nn.CrossEntropyLoss() 語句進入函數,觀察從nn.CrossEntropyLoss()到class Module(object)一共經歷了哪些類,記錄其中全部進入的類及函數。
CrossEntropyLoss.__init__
: super(CrossEntropyLoss, self).__init__
_WeightedLoss.__init__
: super(_WeightedLoss,self).__init__
_Loss.__init__
: super(_Loss, self).__init__()
def __init__(self, size_average=None, reduce=None, reduction='mean'): # 前兩個再也不使用 super(_Loss, self).__init__() if size_average is not None or reduce is not None: self.reduction =_Reduction.legacy_get_string(size_average, reduce) else: self.reduction = reduction
Module.__init__
: _construct
功能:nn.LogSoftmax ()與nn.NLLLoss ()結合,進行交叉熵計算
主要參數:
inputs = torch.tensor([[1, 2], [1, 3], [1, 3]], dtype=torch.float) target = torch.tensor([0, 1, 1], dtype=torch.long) # --------- CrossEntropy loss: reduction ------------ # def loss function loss_f_none = nn.CrossEntropyLoss(weight=None, reduction='none') loss_f_sum = nn.CrossEntropyLoss(weight=None, reduction='sum') loss_f_mean = nn.CrossEntropyLoss(weight=None, reduction='mean') # forward loss_none = loss_f_none(inputs, target) # tensor([1.3133, 0.1269, 0.1269]) loss_sum = loss_f_sum(inputs, target) # tensor(1.5671) loss_mean = loss_f_mean(inputs, target) # tensor(0.5224) ''' 若設置: weights = torch.tensor([1, 2], dtype=torch.float) 結果變爲: tensor([1.3133, 0.2539, 0.2539]) tensor(1.8210) tensor(0.3642) 後面兩項由於類別爲1,故有權2 '''
功能:實現負對數似然函數中的負號功能
功能:二分類交叉熵
注意事項:輸入值取值在[0,1]
inputs = torch.tensor([[1, 2], [2, 2], [3, 4], [4, 5]], dtype=torch.float) target = torch.tensor([[1, 0], [1, 0], [0, 1], [0, 1]], dtype=torch.float) # 注意這裏輸出端的兩個神經元都須要分別計算Loss target_bce = target # itarget inputs = torch.sigmoid(inputs) weights = torch.tensor([1, 1], dtype=torch.float) loss_f_none_w = nn.BCELoss(weight=weights, reduction='none') loss_f_sum = nn.BCELoss(weight=weights, reduction='sum') loss_f_mean = nn.BCELoss(weight=weights, reduction='mean') # forward loss_none_w = loss_f_none_w(inputs, target_bce) loss_sum = loss_f_sum(inputs, target_bce) loss_mean = loss_f_mean(inputs, target_bce) ''' BCE Loss tensor([[0.3133, 2.1269], [0.1269, 2.1269], [3.0486, 0.0181], [4.0181, 0.0067]]) tensor(11.7856) tensor(1.4732) '''
功能:結合Sigmoid與二分類交叉熵
注意事項:網絡最後不加sigmoid函數
添加參數:pos_weight 正樣本的權值
參數即調用方式詳見:
下面僅列出各損失函數功能及表達式
注意:如下公式中出現下標n則表示該函數是對全部神經元進行的逐神經元操做。
函數名 | 功能 | 表達式 |
---|---|---|
CrossEntropyLoss | LogSoftmax 與NLLLoss 結合,進行交叉熵計算 | \(weight[class](-x[class]+log(\sum_j e^{x[j]}))\) |
NLLLoss | 負對數似然函數中的負號功能 | \(-\omega_{y_n} x_{n,y_n}\) |
BCELoss | 二分類交叉熵 | \(-\omega_n [y_nlogx_n+(1-y_n)log(1-x_n)]\) |
BCEWithLogitsLoss | 結合Sigmoid與二分類交叉熵 | \(-\omega_n [y_nlog\sigma(x_n)+(1-y_n)log(1-\sigma(x_n))]\) |
L1Loss | 差的絕對值 | \(\|x_n-y_n\|\) |
MSELoss | 差的平方 | \((x_n-y_n)^2\) |
SmoothL1Loss | 在L1基礎上平滑 | $$ \left{ \begin{array}{lr} 0.5(x_n-y_n)^2 & : |x_n-y_n| < 1\ |x_n-y_n|-0.5 & : otherwise \end{array} \right. $$ |
PoissonNLLLoss | 泊松分佈的負對數似然損失函數 | $$ \left{ \begin{array}{lr} e^{x_n}-y_n x_n & : log(input)\ x_n-y_nlog(x_n+eps) & : otherwise \end{array} \right. $$ |
KLDivLoss | 計算KLD(divergence),KL散度,相對熵(x必須是對數) | \(y_n(logy_n-x_n)\) |
MerginRankingLoss | 計算兩個向量之間的類似度,用於排序任務 | \(max(0, -y(x^{(1)}-x^{(2)})+margin)\) |
MultiLabelMarginLoss | 多標籤邊界損失 | \(\sum_j^{len(y)}\sum_{i\neq y_j}^{len(x)}\frac{max(0, 1-(x_{y_j}-x_i))}{len(x)}\) |
SoftMarginLoss | 二分類logistic損失 | \(\frac{1}{len(x)}\sum_i log(1+e^{-y_i x_i})\) |
MultiLabelSoftMarginLoss | 上述多標籤版本 | \(-\frac{1}{C}\sum_i (y_i log(1+e^{-x_i^{-1}})+(1-y_i)log\frac{e^{-x}}{1+e^{-x}})\) |
MultiMarginLoss | 計算多分類的摺頁損失 | \(\frac{1}{len(x)}\sum_i max(0, margin-x_y+x_i)^p\) |
TripletMarginLoss | 計算三元組損失,人臉驗證中經常使用 | \(max(d(a_i, p_i)-d(a_i,n_i)+margin, 0)\) |
HimgeEmbeddingLoss | 計算兩個輸入的類似性,經常使用於非線性embedding和半監督學習 | $$ \left{ \begin{array}{lr} x_n & : y_n=1\ max(0, margin-x_n) & : y_n=-1 \end{array} \right. $$ |
CosineEmbeddingLoss | 餘弦類似度 | $$ \left{ \begin{array}{lr} 1-cos(x_1,x_2) & : y=1\ max(0,cos(x_1,x_2)-margin) & : y=-1 \end{array} \right. $$ |
CTCLoss | 計算CTC損失,解決時序類數據的分類 | 詳見CTC loss 理解 |
pytorch的優化器:管理並更新模型中可學習參數的值,使得模型輸出更接近真實標籤
導數:函數在指定座標軸上的變化率
方向導數:指定方向上的變化率
梯度:一個向量,方向爲方向導數取得最大值的方向
defaults:優化器超參數
state:參數的緩存,如momentum的緩存
params_groups:管理的參數組,字典的list
_step_count:記錄更新次數,學習率調整中使用
# ========================= step 4/5 優化器 ========================== optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9) # 選擇優化器 scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1) # 設置學習率降低策略 # ============================ step 5/5 訓練 ======================== for epoch in range(MAX_EPOCH): ... for i, data in enumerate(train_loader): ... # backward optimizer.zero_grad() loss = criterion(outputs, labels) loss.backward() # update weights optimizer.step()
zero_grad():清空所管理參數的梯度,pytorch特性:張量梯度不自動清零,而是累加
step():執行一步更新
add_param_group():添加參數組
state_dict():獲取優化器當前狀態信息字典
load_state_dict() :加載狀態信息字典
weight = torch.randn((2, 2), requires_grad=True) weight.grad = torch.ones((2, 2)) optimizer = optim.SGD([weight], lr=0.1) # --------------------------- step ----------------------------------- optimizer.step() # 修改lr=1 0.1觀察結果 ''' weight before step:tensor([[0.6614, 0.2669], [0.0617, 0.6213]]) weight after step:tensor([[ 0.5614, 0.1669], [-0.0383, 0.5213]]) ''' # ------------------------- zero_grad -------------------------------- print("weight in optimizer:{}\nweight in weight:{}\n".format(id(optimizer.param_groups[0]['params'][0]), id(weight))) print("weight.grad is {}\n".format(weight.grad)) optimizer.zero_grad() ''' weight in optimizer:1598931466208 weight in weight:1598931466208 與上一句相同,說明params存儲的是指向數據的指針 weight.grad is tensor([[1., 1.], [1., 1.]]) after optimizer.zero_grad(), weight.grad is tensor([[0., 0.], [0., 0.]]) ''' # --------------------- add_param_group -------------------------- print("optimizer.param_groups is\n{}".format(optimizer.param_groups)) w2 = torch.randn((3, 3), requires_grad=True) optimizer.add_param_group({"params": w2, 'lr': 0.0001}) print("optimizer.param_groups is\n{}".format(optimizer.param_groups)) # ------------------- state_dict ---------------------- optimizer = optim.SGD([weight], lr=0.1, momentum=0.9) opt_state_dict = optimizer.state_dict() for i in range(10): optimizer.step() torch.save(optimizer.state_dict(), os.path.join(BASE_DIR, "optimizer_state_dict.pkl")) # -----------------------load state_dict --------------------------- optimizer = optim.SGD([weight], lr=0.1, momentum=0.9) state_dict = torch.load(os.path.join(BASE_DIR, "optimizer_state_dict.pkl")) optimizer.load_state_dict(state_dict)
梯度降低:
𝒘𝒊+𝟏 = 𝒘𝒊 − 𝒈(𝒘𝒊 )
𝒘𝒊+𝟏 = 𝒘𝒊 − LR * 𝒈(𝒘𝒊)
學習率(learning rate)控制更新的步伐
採用不一樣學習率進行訓練,注意在lr>0.3時出現梯度爆炸
iteration = 100 num_lr = 10 lr_min, lr_max = 0.01, 0.2 # .5 .3 .2 lr_list = np.linspace(lr_min, lr_max, num=num_lr).tolist() loss_rec = [[] for l in range(len(lr_list))] iter_rec = list() for i, lr in enumerate(lr_list): x = torch.tensor([2.], requires_grad=True) for iter in range(iteration): y = func(x) y.backward() x.data.sub_(lr * x.grad) # x.data -= x.grad x.grad.zero_() loss_rec[i].append(y.item()) for i, loss_r in enumerate(loss_rec): plt.plot(range(len(loss_r)), loss_r, label="LR: {}".format(lr_list[i])) plt.legend() plt.xlabel('Iterations') plt.ylabel('Loss value') plt.show()
Momentum(動量,衝量):結合當前梯度與上一次更新信息,用於當前更新
指數加權平均: $$v_t=\beta v_{t-1}+(1-\beta)\theta_t = \sum_i{N}(1-\beta)\betai\theta_{N-1}$$
pytorch中更新公式:
\(v_i=mv_{i-1}+g(w_i)\)
\(w_{i+1}=w_i-lr*v_i\)
1.optim.SGD
主要參數:
• params:管理的參數組
• lr:初始學習率
• momentum:動量係數,貝塔
• weight_decay:L2正則化係數
• nesterov:是否採用NAG
def func(x): return torch.pow(2*x, 2) # y = (2x)^2 = 4*x^2 dy/dx = 8x iteration = 100 m = 0.9 # .9 .63 lr_list = [0.01, 0.03] momentum_list = list() loss_rec = [[] for l in range(len(lr_list))] iter_rec = list() for i, lr in enumerate(lr_list): x = torch.tensor([2.], requires_grad=True) momentum = 0. if lr == 0.03 else m momentum_list.append(momentum) optimizer = optim.SGD([x], lr=lr, momentum=momentum) for iter in range(iteration): y = func(x) y.backward() optimizer.step() optimizer.zero_grad() loss_rec[i].append(y.item())
上述狀況出現彈簧現象,由於在loss接近0的位置仍然有極大的動量,應當適當減少。
其餘9個優化器詳見PyTorch 學習筆記(七):PyTorch的十個優化器