VisualPytorch發佈域名+雙服務器以下:
http://nag.visualpytorch.top/static/ (對應114.115.148.27)
http://visualpytorch.top/static/ (對應39.97.209.22)python
偏差可分解爲:誤差,方差與噪聲之和。即偏差 = 誤差 + 方差 + 噪聲之和
誤差度量了學習算法的指望預測與真實結果的偏離程度,即刻畫了學習算法自己的擬合能力
方差度量了一樣大小的訓練集的變更所致使的學習性能的變化,即刻畫了數據擾動所形成的影響
噪聲則表達了在當前任務上任何學習算法所能達到的指望泛化偏差的下界算法
準確地來講方差指的是不一樣針對 不一樣 數據集的預測值指望的方差,而非訓練集和測試集Loss的差別。服務器
損失函數(Loss Function):\(Loss = f(\hat y , y)\)
代價函數(Cost Function):\(Cost = \frac{1}{N}\sum_i f(\hat y_i, y_i)\)
目標函數(Objective Function):\(Obj = Cost + Regularization\)網絡
L1 Regularization: \(\sum_i |w_i|\) 由於常在座標軸(頂點)上取極值,容易訓練出稀疏參數app
L2 Regularization: \(\sum_i w_i^2\) \(w_{i+1} = w_i - Obj' = w_i - (\frac{\partial Loss}{\partial w_i}+\lambda * w_i) = w_i(1-\lambda) - \frac{\partial Loss}{\partial w_i}\) ,所以常被稱爲權重衰減函數
同時構建使用weight_dacay和沒有的模型:能夠看到,隨着訓練次數增長,不含正則項的模型loss趨於0.性能
# ============================ step 1/5 數據 ============================ def gen_data(num_data=10, x_range=(-1, 1)): w = 1.5 train_x = torch.linspace(*x_range, num_data).unsqueeze_(1) train_y = w*train_x + torch.normal(0, 0.5, size=train_x.size()) test_x = torch.linspace(*x_range, num_data).unsqueeze_(1) test_y = w*test_x + torch.normal(0, 0.3, size=test_x.size()) return train_x, train_y, test_x, test_y train_x, train_y, test_x, test_y = gen_data(x_range=(-1, 1)) # ============================ step 2/5 模型 ============================ class MLP(nn.Module): def __init__(self, neural_num): super(MLP, self).__init__() self.linears = nn.Sequential( nn.Linear(1, neural_num), nn.ReLU(inplace=True), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True), nn.Linear(neural_num, 1), ) def forward(self, x): return self.linears(x) net_normal = MLP(neural_num=n_hidden) net_weight_decay = MLP(neural_num=n_hidden) # ============================ step 3/5 優化器 ============================ optim_normal = torch.optim.SGD(net_normal.parameters(), lr=lr_init, momentum=0.9) optim_wdecay = torch.optim.SGD(net_weight_decay.parameters(), lr=lr_init, momentum=0.9, weight_decay=1e-2) # 包含了weight_decay # ============================ step 4/5 損失函數 ============================ loss_func = torch.nn.MSELoss() # ============================ step 5/5 迭代訓練 ============================ writer = SummaryWriter(comment='_test_tensorboard', filename_suffix="12345678") for epoch in range(max_iter): # forward pred_normal, pred_wdecay = net_normal(train_x), net_weight_decay(train_x) loss_normal, loss_wdecay = loss_func(pred_normal, train_y), loss_func(pred_wdecay, train_y) optim_normal.zero_grad() optim_wdecay.zero_grad() loss_normal.backward() loss_wdecay.backward() optim_normal.step() optim_wdecay.step() ...
經過tensorborad查看參數變化,能夠明顯看出使用L2的模型參數更集中:學習
失活:weight = 0測試
指該層任何一個神經元都有prob的可能性失活,而非有prob的神經元會失活。優化
假設prob = 0.3,在測試時不使用dropout,爲了抵消這種尺度上的變化,須要在訓練期間對權重除 \((1-p)\)
Test: \(100 = \sum_{100} W_x\)
Train: \(70 = \sum_{70} W_x \Longrightarrow 100 = \sum_{70} W_x/(1-p)\)
所以,在兩種狀態下通過網絡層,獲得的結果近似:
net = Net(input_num, d_prob=0.5) net.linears[1].weight.detach().fill_(1.) net.train() # 測試結束後調整回運行狀態 y = net(x) print("output in training mode", y) net.eval() # 測試開始時使用 y = net(x) print("output in eval mode", y) output in training mode tensor([9942.], grad_fn=<ReluBackward1>) output in eval mode tensor([10000.], grad_fn=<ReluBackward1>)
同時構建有dropout和沒有的模型:隨着訓練次數增長,含有dropout的模型會更加平滑。
class MLP(nn.Module): def __init__(self, neural_num, d_prob=0.5): super(MLP, self).__init__() self.linears = nn.Sequential( nn.Linear(1, neural_num), nn.ReLU(inplace=True), nn.Dropout(d_prob), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True), nn.Dropout(d_prob), nn.Linear(neural_num, neural_num), nn.ReLU(inplace=True), nn.Dropout(d_prob), nn.Linear(neural_num, 1), ) def forward(self, x): return self.linears(x) net_prob_0 = MLP(neural_num=n_hidden, d_prob=0.) net_prob_05 = MLP(neural_num=n_hidden, d_prob=0.5)
咱們觀察線性層的權重分佈,能夠明顯看出來含有dropout的模型參數更集中,峯值更高
批:一批數據,一般爲mini-batch
標準化:0均值,1方差
優勢:
注意到,\(\gamma, \beta\)是可學習的參數,若是該層不想進行BN,最後學習出來\(\gamma = \sigma_{\Beta}, \beta = \mu_{\Beta}\),即恆等變換。
詳見《Batch Normalization Accelerating Deep Network Training by Reducing Internal Covariate Shift》閱讀筆記與實現
防止由於數據尺度/分佈的不均使得梯度消失或爆炸,致使訓練困難。
第四節提到的其餘Normalization都是爲了不ICS.
pytorch中nn.BatchNorm1d
nn.BatchNorm2d
nn.BatchNorm3d
都繼承於_BatchNorm
,而且有如下參數:
__init__(self, num_features, # 一個樣本特徵數量(最重要) eps=1e-5, # 分母修正項 momentum=0.1, # 指數加權平均估計當前mean/var affine=True, # 是否須要affine transform track_running_stats=True) # 是訓練狀態,仍是測試狀態
BatchNorm層主要參數:
訓練:均值和方差採用指數加權平均計算
running_mean = (1 - momentum) * running_mean + momentum * mean_t
running_var = (1 - momentum) * running_var + momentum * var_t
測試:當前統計值
如上圖所示:在1D,2D,3D中,特徵數分別指 特徵、特徵圖、特徵核的數目。而BN是對於每一個特徵對應的全部樣本求的均值和方差,故如上圖中三種狀況樣本數分別爲5,3,3,而對應的\(\gamma, \beta\)維數即5,3,3
咱們原始的網絡在卷積和線性層以後加入了BN層,注意其中卷積層後是2D BN,線性層後是1D
class LeNet_bn(nn.Module): def __init__(self, classes): super(LeNet_bn, self).__init__() self.conv1 = nn.Conv2d(3, 6, 5) self.bn1 = nn.BatchNorm2d(num_features=6) self.conv2 = nn.Conv2d(6, 16, 5) self.bn2 = nn.BatchNorm2d(num_features=16) self.fc1 = nn.Linear(16 * 5 * 5, 120) self.bn3 = nn.BatchNorm1d(num_features=120) self.fc2 = nn.Linear(120, 84) self.fc3 = nn.Linear(84, classes) def forward(self, x): out = self.conv1(x) out = self.bn1(out) out = F.relu(out) out = F.max_pool2d(out, 2) out = self.conv2(out) out = self.bn2(out) out = F.relu(out) out = F.max_pool2d(out, 2) out = out.view(out.size(0), -1) out = self.fc1(out) out = self.bn3(out) out = F.relu(out) out = F.relu(self.fc2(out)) out = self.fc3(out) return out def initialize_weights(self): for m in self.modules(): if isinstance(m, nn.Conv2d): nn.init.xavier_normal_(m.weight.data) if m.bias is not None: m.bias.data.zero_() elif isinstance(m, nn.BatchNorm2d): m.weight.data.fill_(1) m.bias.data.zero_() elif isinstance(m, nn.Linear): nn.init.normal_(m.weight.data, 0, 1) m.bias.data.zero_()
net = LeNet(classes=2)
不通過初始化:net.initialize_weights()
:net = LeNet_bn(classes=2)
結果以下:即便Loss有不穩定的區間,其最大值不像前兩種超過1.5原由:BN不適用於變長的網絡,如RNN
思路:逐層計算均值和方差
注意事項:
nn.LayerNorm(normalized_shape, # 該層特徵形狀 eps=1e-05, elementwise_affine=True # 是否須要affine transform )
注意,這裏的normalized_shape
能夠是輸入後面任意維特徵。好比[8, 6, 3, 4]
爲batch的輸入,能夠是[6,3,4]
, [3,4]
,[4]
,但不能是[6,3]
原由:BN在圖像生成(Image Generation)中不適用,圖像中輸入的Batch各不相同,不能逐Batch標準化
思路:逐Instance(channel)計算均值和方差
nn.InstanceNorm2d(num_features, eps=1e-05, momentum=0.1, affine=False, track_running_stats=False) # 一樣還有1d, 3d
圖像風格遷移就是一種不能BN的應用,輸入的圖片各不相同,只能逐通道求方差和均值
原由:小batch樣本中,BN估計的值不許
思路:數據不夠,通道來湊
nn.GroupNorm(num_groups, # 分組個數,必須是num_channel的因子 num_channels, eps=1e-05, affine=True)
注意事項:
應用場景:大模型(小batch size)任務
當num_groups=num時,至關於LN
當num_groups=1時,至關於IN