Pytorch_Part6_正則化

VisualPytorch發佈域名+雙服務器以下:
http://nag.visualpytorch.top/static/ (對應114.115.148.27)
http://visualpytorch.top/static/ (對應39.97.209.22)python

1、正則化之weight_decay

1. Regularization:減少方差的策略

偏差可分解爲:誤差,方差與噪聲之和。即偏差 = 誤差 + 方差 + 噪聲之和
誤差度量了學習算法的指望預測與真實結果的偏離程度,即刻畫了學習算法自己的擬合能力
方差度量了一樣大小的訓練集的變更所致使的學習性能的變化,即刻畫了數據擾動所形成的影響
噪聲則表達了在當前任務上任何學習算法所能達到的指望泛化偏差的下界算法

image-20200326115333551

準確地來講方差指的是不一樣針對 不一樣 數據集的預測值指望的方差,而非訓練集和測試集Loss的差別。服務器

2. 損失函數:衡量模型輸出與真實標籤的差別

損失函數(Loss Function):\(Loss = f(\hat y , y)\)
代價函數(Cost Function):\(Cost = \frac{1}{N}\sum_i f(\hat y_i, y_i)\)
目標函數(Objective Function):\(Obj = Cost + Regularization\)網絡

image-20200326115333551

L1 Regularization: \(\sum_i |w_i|\) 由於常在座標軸(頂點)上取極值,容易訓練出稀疏參數app

L2 Regularization: \(\sum_i w_i^2\) \(w_{i+1} = w_i - Obj' = w_i - (\frac{\partial Loss}{\partial w_i}+\lambda * w_i) = w_i(1-\lambda) - \frac{\partial Loss}{\partial w_i}\) ,所以常被稱爲權重衰減函數

3. 以簡單的三層感知機爲例

同時構建使用weight_dacay和沒有的模型:能夠看到,隨着訓練次數增長,不含正則項的模型loss趨於0.性能

# ============================ step 1/5 數據 ============================
def gen_data(num_data=10, x_range=(-1, 1)):

    w = 1.5
    train_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size())
    test_x = torch.linspace(*x_range, num_data).unsqueeze_(1)
    test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size())

    return train_x, train_y, test_x, test_y


train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1))


# ============================ step 2/5 模型 ============================
class MLP(nn.Module):
    def __init__(self, neural_num):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(
            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)


net_normal = MLP(neural_num=n_hidden)
net_weight_decay = MLP(neural_num=n_hidden)

# ============================ step 3/5 優化器 ============================
optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9)
optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2)
# 包含了weight_decay

# ============================ step 4/5 損失函數 ============================
loss_func = torch.nn.MSELoss()

# ============================ step 5/5 迭代訓練 ============================

writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678")
for epoch in range(max_iter):

    # forward
    pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x)
    loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y)

    optim_normal.zero_grad()
    optim_wdecay.zero_grad()

    loss_normal.backward()
    loss_wdecay.backward()

    optim_normal.step()
    optim_wdecay.step()

    ...

經過tensorborad查看參數變化,能夠明顯看出使用L2的模型參數更集中:學習

2、 正則化之Dropout

1. 隨機:dropout probability

失活:weight = 0測試

指該層任何一個神經元都有prob的可能性失活,而非有prob的神經元會失活。優化

2. 帶來如下三種影響:

  • 特徵依賴性下降
  • 權重數值平均化
  • 數據尺度減少

假設prob = 0.3,在測試時不使用dropout,爲了抵消這種尺度上的變化,須要在訓練期間對權重除 \((1-p)\)
Test: \(100 = \sum_{100} W_x\)
Train: \(70 = \sum_{70} W_x \Longrightarrow 100 = \sum_{70} W_x/(1-p)\)

所以,在兩種狀態下通過網絡層,獲得的結果近似:

net = Net(input_num, d_prob=0.5)
net.linears[1].weight.detach().fill_(1.)

net.train() # 測試結束後調整回運行狀態
y = net(x)
print("output in training mode", y)

net.eval()	# 測試開始時使用
y = net(x)
print("output in eval mode", y)

output in training mode tensor([9942.], grad_fn=<ReluBackward1>)
output in eval mode tensor([10000.], grad_fn=<ReluBackward1>)

3. 仍以線性迴歸爲例:

同時構建有dropout和沒有的模型:隨着訓練次數增長,含有dropout的模型會更加平滑。

class MLP(nn.Module):
    def __init__(self, neural_num, d_prob=0.5):
        super(MLP, self).__init__()
        self.linears = nn.Sequential(

            nn.Linear(1, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, neural_num),
            nn.ReLU(inplace=True),

            nn.Dropout(d_prob),
            nn.Linear(neural_num, 1),
        )

    def forward(self, x):
        return self.linears(x)


net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.)
net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)

咱們觀察線性層的權重分佈,能夠明顯看出來含有dropout的模型參數更集中,峯值更高

3、Batch Normalization

1. Batch Normalization:批標準化

批:一批數據,一般爲mini-batch
標準化:0均值,1方差

優勢:

  1. 能夠更大學習率,加速模型收斂
  2. 能夠不用精心設計權值初始化
  3. 能夠不用dropout或較小的dropout
  4. 能夠不用L2或者較小的weight decay
  5. 能夠不用LRN(local response normalization)

注意到,\(\gamma, \beta\)是可學習的參數,若是該層不想進行BN,最後學習出來\(\gamma = \sigma_{\Beta}, \beta = \mu_{\Beta}\),即恆等變換。

詳見《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》閱讀筆記與實現

2. Internal Covariate Shift (ICS)

防止由於數據尺度/分佈的不均使得梯度消失或爆炸,致使訓練困難。

第四節提到的其餘Normalization都是爲了不ICS.

3. _BatchNorm

pytorch中nn.BatchNorm1d nn.BatchNorm2d nn.BatchNorm3d 都繼承於_BatchNorm,而且有如下參數:

__init__(self, num_features,  	# 一個樣本特徵數量(最重要)
                eps=1e-5, 		# 分母修正項
                momentum=0.1, 	# 指數加權平均估計當前mean/var
                affine=True,	# 是否須要affine transform
                track_running_stats=True)	# 是訓練狀態,仍是測試狀態

BatchNorm層主要參數:

  • running_mean:均值
  • running_var:方差
  • weight:affine transform中的gamma
  • bias: affine transform中的beta

訓練:均值和方差採用指數加權平均計算

running_mean = (1 - momentum) * running_mean + momentum * mean_t

running_var = (1 - momentum) * running_var + momentum * var_t

測試:當前統計值

如上圖所示:在1D,2D,3D中,特徵數分別指 特徵、特徵圖、特徵核的數目。而BN是對於每一個特徵對應的全部樣本求的均值和方差,故如上圖中三種狀況樣本數分別爲5,3,3,而對應的\(\gamma, \beta\)維數即5,3,3

4. 仍以人民幣二分類爲例:

咱們原始的網絡在卷積和線性層以後加入了BN層,注意其中卷積層後是2D BN,線性層後是1D

class LeNet_bn(nn.Module):
    def __init__(self, classes):
        super(LeNet_bn, self).__init__()
        self.conv1 = nn.Conv2d(3, 6, 5)
        self.bn1 = nn.BatchNorm2d(num_features=6)

        self.conv2 = nn.Conv2d(6, 16, 5)
        self.bn2 = nn.BatchNorm2d(num_features=16)

        self.fc1 = nn.Linear(16 * 5 * 5, 120)
        self.bn3 = nn.BatchNorm1d(num_features=120)

        self.fc2 = nn.Linear(120, 84)
        self.fc3 = nn.Linear(84, classes)

    def forward(self, x):
        out = self.conv1(x)
        out = self.bn1(out)
        out = F.relu(out)

        out = F.max_pool2d(out, 2)

        out = self.conv2(out)
        out = self.bn2(out)
        out = F.relu(out)

        out = F.max_pool2d(out, 2)

        out = out.view(out.size(0), -1)

        out = self.fc1(out)
        out = self.bn3(out)
        out = F.relu(out)

        out = F.relu(self.fc2(out))
        out = self.fc3(out)
        return out

    def initialize_weights(self):
        for m in self.modules():
            if isinstance(m, nn.Conv2d):
                nn.init.xavier_normal_(m.weight.data)
                if m.bias is not None:
                    m.bias.data.zero_()
            elif isinstance(m, nn.BatchNorm2d):
                m.weight.data.fill_(1)
                m.bias.data.zero_()
            elif isinstance(m, nn.Linear):
                nn.init.normal_(m.weight.data, 0, 1)
                m.bias.data.zero_()
  1. 使用net = LeNet(classes=2)不通過初始化:

  1. 通過精心設計的初始化net.initialize_weights()

  1. 使用net = LeNet_bn(classes=2)結果以下:即便Loss有不穩定的區間,其最大值不像前兩種超過1.5

4、Normalizaiton_layers

1. Layer Normalization

原由:BN不適用於變長的網絡,如RNN
思路:逐層計算均值和方差
注意事項:

  1. 再也不有running_mean和running_var
  2. gamma和beta爲逐元素
nn.LayerNorm(normalized_shape, # 該層特徵形狀
            eps=1e-05, 
            elementwise_affine=True	# 是否須要affine transform
            )

注意,這裏的normalized_shape能夠是輸入後面任意維特徵。好比[8, 6, 3, 4]爲batch的輸入,能夠是[6,3,4], [3,4],[4],但不能是[6,3]

2. Instance Normalization

原由:BN在圖像生成(Image Generation)中不適用,圖像中輸入的Batch各不相同,不能逐Batch標準化
思路:逐Instance(channel)計算均值和方差

nn.InstanceNorm2d(num_features, 
                eps=1e-05, 
                momentum=0.1, 
                affine=False, 
                track_running_stats=False)
# 一樣還有1d, 3d

圖像風格遷移就是一種不能BN的應用,輸入的圖片各不相同,只能逐通道求方差和均值

3. Group Normalization

原由:小batch樣本中,BN估計的值不許
思路:數據不夠,通道來湊

nn.GroupNorm(num_groups, 	# 分組個數,必須是num_channel的因子
            num_channels, 
            eps=1e-05, 
            affine=True)

注意事項:

  1. 再也不有running_mean和running_var
  2. gamma和beta爲逐通道(channel)的

應用場景:大模型(小batch size)任務

當num_groups=num時,至關於LN
當num_groups=1時,至關於IN

4. 小結

相關文章
相關標籤/搜索