Pytorch_Part4_損失函數

時間 2020-05-09

標籤 pytorch part4 損失函數简体版

原文原文鏈接

VisualPytorch發佈域名+雙服務器以下：
http://nag.visualpytorch.top/static/ (對應114.115.148.27)
http://visualpytorch.top/static/ (對應39.97.209.22)python

1、權值初始化

1. 梯度消失與爆炸

$E(XY)=E(X)E(Y)$
$D(X)=E(X^2)-E(X)^2$
$D(X+Y)=D(X)+D(Y)$
$\Longrightarrow D(XY) = D(X)D(Y)+D(X)E(Y)^2+D(Y)E(X)^2=D(X)D(Y)$
$H_{11}=\sum_{i=0}^n X_i*W_{1i}$git

$\Longrightarrow D(H_{11})=\sum_{i=0}^n D(X_i)*D(W_{1i})=n*1*1=n$
$std(H_{11}=\sqrt n)$數組

若仍使 $D(H_1)=nD(X)D(W)=1$
$\Longrightarrow D(W)=\frac{1}{n}$緩存

import os
import torch
import random
import numpy as np
import torch.nn as nn
from common_tools import set_seed

set_seed(3)  # 設置隨機種子


class MLP(nn.Module):
    def __init__(self, neural_num, layers):
        super(MLP, self).__init__()
        self.linears = nn.ModuleList([nn.Linear(neural_num, neural_num, bias=False) for i in range(layers)])
        self.neural_num = neural_num

    def forward(self, x):
        for (i, linear) in enumerate(self.linears):
            x = linear(x)
            # x = torch.relu(x)

            print("layer:{}, std:{}".format(i, x.std()))
            if torch.isnan(x.std()):
                print("output is nan in {} layers".format(i))
                break

        return x

    def initialize(self):
        for m in self.modules():
            if isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, std=1)

    layer_nums = 100
    neural_nums = 256
    batch_size = 16

    net = MLP(neural_nums, layer_nums)
    net.initialize()

    inputs = torch.randn((batch_size, neural_nums))  # normal: mean=0, std=1

    output = net(inputs)
    print(output)

將W設置爲0均值，1標準差的標準正太分佈，出現以下所示梯度爆炸現象。正如所料，每層std增長的倍數大概爲$\sqrt{256}=16$ ，將W的標準差設爲 np.sqrt(1/self.neural_num) 則正常。服務器

layer:0, std:16.0981502532959
layer:1, std:253.29345703125
layer:2, std:3982.99951171875
...
layer:30, std:2.2885405881461517e+37
layer:31, std:nan
output is nan in 31 layers
tensor([[ 4.9907e+37,        -inf,         inf,  ...,         inf,
                -inf,         inf],
        [       -inf,         inf,  2.1733e+38,  ...,  9.1766e+37,
         -4.5777e+37,  3.3680e+37],
        [ 1.4215e+38,        -inf,         inf,  ...,        -inf,
                 inf,         inf],
        ...,
        [-9.2355e+37, -9.9121e+37, -3.7809e+37,  ...,  4.6074e+37,
          2.2305e+36,  1.2982e+38],
        [       -inf,         inf,        -inf,  ...,        -inf,
         -2.2394e+38,  2.0295e+36],
        [       -inf,         inf,  2.1518e+38,  ...,        -inf,
          1.6132e+38,        -inf]], grad_fn=<MmBackward>)

2. 激活函數初始化

對於不一樣的激活函數，對W的標準差初始化也不一樣，保持數據尺度維持在恰當範圍，一般方差爲1網絡

Sigmoid, tanh------Xavier
app

Relu------Kaimingdom

$D(W)=\frac{2}{n_i}$函數

$D(W)=\frac{2}{(1+a^2) n_i}$ , LeakyRelu學習

a = np.sqrt(6 / (self.neural_num + self.neural_num))
tanh_gain = nn.init.calculate_gain('tanh')
a *= tanh_gain
nn.init.uniform_(m.weight.data, -a, a)
# 同 nn.init.xavier_uniform_(m.weight.data, gain=tanh_gain)

nn.init.normal_(m.weight.data, std=np.sqrt(2 / self.neural_num))
# 同 nn.init.kaiming_normal_(m.weight.data)

3. 十種初始化方法

Xavier均勻分佈
Xavier正態分佈
Kaiming均勻分佈
Kaiming正態分佈
均勻分佈
正態分佈
常數分佈
正交矩陣初始化
單位矩陣初始化
稀疏矩陣初始化

nn.init.calculate_gain(nonlinearity, param=None)

x = torch.randn(10000)
out = torch.tanh(x)
gain = x.std() / out.std()		# 1.5909514427185059

tanh_gain = nn.init.calculate_gain('tanh')  # 1.6666666666666667
# 至關於每通過一層tanh，x的標準差縮小1.6倍

主要功能：計算激活函數的方差變化尺度
主要參數

nonlinearity: 激活函數名稱
param: 激活函數的參數，如Leaky ReLU的negative_slop

2、損失函數

0. 損失函數、熵

損失函數(Loss Function)：衡量模型輸出與真實標籤的差別
$Loss=f(\hat y , y)$

代價函數(Cost Function)：
$Cost = \frac{1}{N}\sum_i^N f(\hat y_i, y)$

目標函數(Objective Function)：
$Obj=Cost+Regularization$

1. nn.CrossEntropyLoss

nn.CrossEntropyLoss調用過程：

用步進(Step into)的調試方法從loss_functoin = nn.CrossEntropyLoss() 語句進入函數，觀察從nn.CrossEntropyLoss()到class Module(object)一共經歷了哪些類，記錄其中全部進入的類及函數。

CrossEntropyLoss.__init__: super(CrossEntropyLoss, self).__init__
_WeightedLoss.__init__: super(_WeightedLoss,self).__init__

_Loss.__init__: super(_Loss, self).__init__()

def __init__(self, size_average=None, reduce=None, reduction='mean'): # 前兩個再也不使用
    super(_Loss, self).__init__()
    if size_average is not None or reduce is not None:
        self.reduction =_Reduction.legacy_get_string(size_average, reduce)
    else:
        self.reduction = reduction

Module.__init__: _construct

功能：nn.LogSoftmax ()與nn.NLLLoss ()結合，進行交叉熵計算
主要參數：

weight：各種別的loss設置權值
ignore_index：忽略某個類別
reduction ：計算模式，可爲none/sum/mean
- none 逐個元素計算
- sum 全部元素求和，返回標量
- mean 加權平均，返回標量

inputs = torch.tensor([[1, 2], [1, 3], [1, 3]], dtype=torch.float)
target = torch.tensor([0, 1, 1], dtype=torch.long)

# --------- CrossEntropy loss: reduction ------------
# def loss function
loss_f_none = nn.CrossEntropyLoss(weight=None, reduction='none')
loss_f_sum = nn.CrossEntropyLoss(weight=None, reduction='sum')
loss_f_mean = nn.CrossEntropyLoss(weight=None, reduction='mean')

# forward
loss_none = loss_f_none(inputs, target)	# tensor([1.3133, 0.1269, 0.1269]) 
loss_sum = loss_f_sum(inputs, target)	# tensor(1.5671) 
loss_mean = loss_f_mean(inputs, target) # tensor(0.5224)

'''
若設置：
weights = torch.tensor([1, 2], dtype=torch.float)
結果變爲：
tensor([1.3133, 0.2539, 0.2539]) tensor(1.8210) tensor(0.3642)
後面兩項由於類別爲1，故有權2
'''

2. nn.NLLLoss

功能：實現負對數似然函數中的負號功能

3. nn.BCELoss（逐神經元求Loss）

功能：二分類交叉熵
注意事項：輸入值取值在[0,1]

inputs = torch.tensor([[1, 2], [2, 2], [3, 4], [4, 5]], dtype=torch.float)
target = torch.tensor([[1, 0], [1, 0], [0, 1], [0, 1]], dtype=torch.float)
# 注意這裏輸出端的兩個神經元都須要分別計算Loss

target_bce = target

# itarget
inputs = torch.sigmoid(inputs)

weights = torch.tensor([1, 1], dtype=torch.float)

loss_f_none_w = nn.BCELoss(weight=weights, reduction='none')
loss_f_sum = nn.BCELoss(weight=weights, reduction='sum')
loss_f_mean = nn.BCELoss(weight=weights, reduction='mean')

# forward
loss_none_w = loss_f_none_w(inputs, target_bce) 
loss_sum = loss_f_sum(inputs, target_bce)
loss_mean = loss_f_mean(inputs, target_bce)

'''
BCE Loss tensor([[0.3133, 2.1269],
        [0.1269, 2.1269],
        [3.0486, 0.0181],
        [4.0181, 0.0067]]) tensor(11.7856) tensor(1.4732)
'''

4. nn.BCEWithLogitsLoss（逐神經元求Loss）

功能：結合Sigmoid與二分類交叉熵
注意事項：網絡最後不加sigmoid函數
添加參數：pos_weight 正樣本的權值

5. 18種損失函數總結

參數即調用方式詳見：

下面僅列出各損失函數功能及表達式

注意：如下公式中出現下標n則表示該函數是對全部神經元進行的逐神經元操做。

函數名	功能	表達式
CrossEntropyLoss	LogSoftmax 與NLLLoss 結合，進行交叉熵計算	$weight[class](-x[class]+log(\sum_j e^{x[j]}))$
NLLLoss	負對數似然函數中的負號功能	$-\omega_{y_n} x_{n,y_n}$
BCELoss	二分類交叉熵	$-\omega_n [y_nlogx_n+(1-y_n)log(1-x_n)]$
BCEWithLogitsLoss	結合Sigmoid與二分類交叉熵	$-\omega_n [y_nlog\sigma(x_n)+(1-y_n)log(1-\sigma(x_n))]$
L1Loss	差的絕對值	$\\|x_n-y_n\\|$
MSELoss	差的平方	$(x_n-y_n)^2$
SmoothL1Loss	在L1基礎上平滑	$$ \left{ \begin{array}{lr} 0.5(x_n-y_n)^2 & : \|x_n-y_n\| < 1\ \|x_n-y_n\|-0.5 & : otherwise \end{array} \right. $$
PoissonNLLLoss	泊松分佈的負對數似然損失函數	$$ \left{ \begin{array}{lr} e^{x_n}-y_n x_n & : log(input)\ x_n-y_nlog(x_n+eps) & : otherwise \end{array} \right. $$
KLDivLoss	計算KLD（divergence），KL散度，相對熵（x必須是對數）	$y_n(logy_n-x_n)$
MerginRankingLoss	計算兩個向量之間的類似度，用於排序任務	$max(0, -y(x^{(1)}-x^{(2)})+margin)$
MultiLabelMarginLoss	多標籤邊界損失	$\sum_j^{len(y)}\sum_{i\neq y_j}^{len(x)}\frac{max(0, 1-(x_{y_j}-x_i))}{len(x)}$
SoftMarginLoss	二分類logistic損失	$\frac{1}{len(x)}\sum_i log(1+e^{-y_i x_i})$
MultiLabelSoftMarginLoss	上述多標籤版本	$-\frac{1}{C}\sum_i (y_i log(1+e^{-x_i^{-1}})+(1-y_i)log\frac{e^{-x}}{1+e^{-x}})$
MultiMarginLoss	計算多分類的摺頁損失	$\frac{1}{len(x)}\sum_i max(0, margin-x_y+x_i)^p$
TripletMarginLoss	計算三元組損失，人臉驗證中經常使用	$max(d(a_i, p_i)-d(a_i,n_i)+margin, 0)$
HimgeEmbeddingLoss	計算兩個輸入的類似性，經常使用於非線性embedding和半監督學習	$$ \left{ \begin{array}{lr} x_n & : y_n=1\ max(0, margin-x_n) & : y_n=-1 \end{array} \right. $$
CosineEmbeddingLoss	餘弦類似度	$$ \left{ \begin{array}{lr} 1-cos(x_1,x_2) & : y=1\ max(0,cos(x_1,x_2)-margin) & : y=-1 \end{array} \right. $$
CTCLoss	計算CTC損失，解決時序類數據的分類	詳見CTC loss 理解

3、優化器

pytorch的優化器：管理並更新模型中可學習參數的值，使得模型輸出更接近真實標籤
導數：函數在指定座標軸上的變化率
方向導數：指定方向上的變化率
梯度：一個向量，方向爲方向導數取得最大值的方向

1. Optimizer基本屬性

defaults：優化器超參數
state：參數的緩存，如momentum的緩存
params_groups：管理的參數組，字典的list
_step_count：記錄更新次數，學習率調整中使用

# ========================= step 4/5 優化器 ==========================
optimizer = optim.SGD(net.parameters(), lr=LR, momentum=0.9)                        # 選擇優化器
scheduler = torch.optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)     # 設置學習率降低策略

# ============================ step 5/5 訓練 ========================
for epoch in range(MAX_EPOCH):
	...
    for i, data in enumerate(train_loader):
        ...
    	# backward
        optimizer.zero_grad()
        loss = criterion(outputs, labels)
        loss.backward()

        # update weights
        optimizer.step()

2. 基本方法

zero_grad()：清空所管理參數的梯度，pytorch特性：張量梯度不自動清零，而是累加

step()：執行一步更新

add_param_group()：添加參數組

state_dict()：獲取優化器當前狀態信息字典

load_state_dict() ：加載狀態信息字典

weight = torch.randn((2, 2), requires_grad=True)
weight.grad = torch.ones((2, 2))

optimizer = optim.SGD([weight], lr=0.1)

# --------------------------- step -----------------------------------
optimizer.step()        # 修改lr=1 0.1觀察結果
'''
weight before step:tensor([[0.6614, 0.2669],
        [0.0617, 0.6213]])
weight after step:tensor([[ 0.5614,  0.1669],
        [-0.0383,  0.5213]])
'''

# ------------------------- zero_grad --------------------------------
print("weight in optimizer:{}\nweight in weight:{}\n".format(id(optimizer.param_groups[0]['params'][0]), id(weight)))
print("weight.grad is {}\n".format(weight.grad))	

optimizer.zero_grad()
'''
weight in optimizer:1598931466208
weight in weight:1598931466208 與上一句相同，說明params存儲的是指向數據的指針

weight.grad is tensor([[1., 1.],
        [1., 1.]])

after optimizer.zero_grad(), weight.grad is
tensor([[0., 0.],
        [0., 0.]])
'''


# --------------------- add_param_group --------------------------
print("optimizer.param_groups is\n{}".format(optimizer.param_groups))
w2 = torch.randn((3, 3), requires_grad=True)
optimizer.add_param_group({"params": w2, 'lr': 0.0001})
print("optimizer.param_groups is\n{}".format(optimizer.param_groups))

# ------------------- state_dict ----------------------
optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
opt_state_dict = optimizer.state_dict()

for i in range(10):
    optimizer.step()

torch.save(optimizer.state_dict(), os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))

# -----------------------load state_dict ---------------------------
optimizer = optim.SGD([weight], lr=0.1, momentum=0.9)
state_dict = torch.load(os.path.join(BASE_DIR, "optimizer_state_dict.pkl"))

optimizer.load_state_dict(state_dict)

3. 學習率

梯度降低:
𝒘𝒊+𝟏 = 𝒘𝒊 − 𝒈(𝒘𝒊 )
𝒘𝒊+𝟏 = 𝒘𝒊 − LR * 𝒈(𝒘𝒊)
學習率（learning rate）控制更新的步伐

採用不一樣學習率進行訓練，注意在lr>0.3時出現梯度爆炸

iteration = 100
num_lr = 10
lr_min, lr_max = 0.01, 0.2  # .5 .3 .2

lr_list = np.linspace(lr_min, lr_max, num=num_lr).tolist()
loss_rec = [[] for l in range(len(lr_list))]
iter_rec = list()

for i, lr in enumerate(lr_list):
x = torch.tensor([2.], requires_grad=True)
for iter in range(iteration):

y = func(x)
y.backward()
x.data.sub_(lr * x.grad)  # x.data -= x.grad
x.grad.zero_()

loss_rec[i].append(y.item())

for i, loss_r in enumerate(loss_rec):
plt.plot(range(len(loss_r)), loss_r, label="LR: {}".format(lr_list[i]))
plt.legend()
plt.xlabel('Iterations')
plt.ylabel('Loss value')
plt.show()

4. momentum

Momentum（動量，衝量）：結合當前梯度與上一次更新信息，用於當前更新

指數加權平均： $$v_t=\beta v_{t-1}+(1-\beta)\theta_t = \sum_i^{{N}(1-\beta)\beta}i\theta_{N-1}$$

pytorch中更新公式：
$v_i=mv_{i-1}+g(w_i)$
$w_{i+1}=w_i-lr*v_i$

1.optim.SGD
主要參數：
• params：管理的參數組
• lr：初始學習率
• momentum：動量係數，貝塔
• weight_decay：L2正則化係數
• nesterov：是否採用NAG

def func(x):
    return torch.pow(2*x, 2)    # y = (2x)^2 = 4*x^2        dy/dx = 8x

iteration = 100
m = 0.9     # .9 .63

lr_list = [0.01, 0.03]

momentum_list = list()
loss_rec = [[] for l in range(len(lr_list))]
iter_rec = list()

for i, lr in enumerate(lr_list):
    x = torch.tensor([2.], requires_grad=True)

    momentum = 0. if lr == 0.03 else m
    momentum_list.append(momentum)

    optimizer = optim.SGD([x], lr=lr, momentum=momentum)

    for iter in range(iteration):

        y = func(x)
        y.backward()

        optimizer.step()
        optimizer.zero_grad()

        loss_rec[i].append(y.item())

上述狀況出現彈簧現象，由於在loss接近0的位置仍然有極大的動量，應當適當減少。

其餘9個優化器詳見PyTorch 學習筆記（七）：PyTorch的十個優化器

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

函數名	功能	表達式
CrossEntropyLoss	LogSoftmax 與NLLLoss 結合，進行交叉熵計算	\(weight[class](-x[class]+log(\sum_j e^{x[j]}))\)
NLLLoss	負對數似然函數中的負號功能	\(-\omega_{y_n} x_{n,y_n}\)
BCELoss	二分類交叉熵	\(-\omega_n [y_nlogx_n+(1-y_n)log(1-x_n)]\)
BCEWithLogitsLoss	結合Sigmoid與二分類交叉熵	\(-\omega_n [y_nlog\sigma(x_n)+(1-y_n)log(1-\sigma(x_n))]\)
L1Loss	差的絕對值	\(\\|x_n-y_n\\|\)
MSELoss	差的平方	\((x_n-y_n)^2\)
SmoothL1Loss	在L1基礎上平滑	$$ \left{ \begin{array}{lr} 0.5(x_n-y_n)^2 & : \|x_n-y_n\| < 1\ \|x_n-y_n\|-0.5 & : otherwise \end{array} \right. $$
PoissonNLLLoss	泊松分佈的負對數似然損失函數	$$ \left{ \begin{array}{lr} e^{x_n}-y_n x_n & : log(input)\ x_n-y_nlog(x_n+eps) & : otherwise \end{array} \right. $$
KLDivLoss	計算KLD（divergence），KL散度，相對熵（x必須是對數）	\(y_n(logy_n-x_n)\)
MerginRankingLoss	計算兩個向量之間的類似度，用於排序任務	\(max(0, -y(x^{(1)}-x^{(2)})+margin)\)
MultiLabelMarginLoss	多標籤邊界損失	\(\sum_j^{len(y)}\sum_{i\neq y_j}^{len(x)}\frac{max(0, 1-(x_{y_j}-x_i))}{len(x)}\)
SoftMarginLoss	二分類logistic損失	\(\frac{1}{len(x)}\sum_i log(1+e^{-y_i x_i})\)
MultiLabelSoftMarginLoss	上述多標籤版本	\(-\frac{1}{C}\sum_i (y_i log(1+e^{-x_i^{-1}})+(1-y_i)log\frac{e^{-x}}{1+e^{-x}})\)
MultiMarginLoss	計算多分類的摺頁損失	\(\frac{1}{len(x)}\sum_i max(0, margin-x_y+x_i)^p\)
TripletMarginLoss	計算三元組損失，人臉驗證中經常使用	\(max(d(a_i, p_i)-d(a_i,n_i)+margin, 0)\)
HimgeEmbeddingLoss	計算兩個輸入的類似性，經常使用於非線性embedding和半監督學習	$$ \left{ \begin{array}{lr} x_n & : y_n=1\ max(0, margin-x_n) & : y_n=-1 \end{array} \right. $$
CosineEmbeddingLoss	餘弦類似度	$$ \left{ \begin{array}{lr} 1-cos(x_1,x_2) & : y=1\ max(0,cos(x_1,x_2)-margin) & : y=-1 \end{array} \right. $$
CTCLoss	計算CTC損失，解決時序類數據的分類	詳見CTC loss 理解