torch 深度學習(3)

時間 2019-11-25

標籤 torch 深度學習简体版

原文原文鏈接

torch 深度學習(3)

損失函數，模型訓練

前面咱們已經完成對數據的預處理和模型的構建，那麼接下來爲了訓練模型應該定義模型的損失函數，而後使用BP算法對模型參數進行調整git

損失函數 Criterion

加載包github

require 'torch'
require 'nn'        -- 各類損失函數也是 'nn'這個模塊裏面的

設定命令行參數web

if not opt then 
    print "==> processing options:"
    cmd = torch.CmdLine()
    cmd:text()
    cmd:text('Options:')
    cmd:text()
    cmd:option('-loss','nll','type of loss function to minimize: nll | mse | margin')
    -- nll: negative log-likelihood; mse:mean-square error; margin: margin loss(SVM 相似的最大間隔準則)
    cmd:text()
    opt=cmd:parse(arg or {})
    
    model = nn. Sequential()
    -- 這個model主要是爲了可以使該損失函數文件可以單獨運行，最後運行整個項目時，並不會執行到這裏
end

定義損失函數算法

noutputs = 10 -- 這個主要是 mse 損失函數會用到
if opt.loss == 'margin' then 
    criterion = nn.MultiMarginCriterion()
elseif opt.loss == 'nll' then
    -- 因爲negative log-likelihood 計算須要輸入是一種機率分佈，因此須要對模型輸出進行適當的歸一化，通常可使用 logsoftmax層
    model:add(nn.LogSoftMax()) --注意這裏輸出的是向量，機率分佈
    criterion = nn.NLLCriterion()
elseif opt.loss = 'mse' then
    -- 這個損失函數用於數據的擬合，而不是數據的分類，由於對於分類問題，只要分正確就能夠，不必非得和標號一致。並且對於分類問題，好比兩類，能夠標號爲 1，2，也能夠標號爲3，4，擬合併無實際意義。
    -- 這裏主要是順便了解一下如何定義，並不會用到這個損失函數
    criterion = nn.MSECriterion()
    
    -- Compared to the other losses, MSE criterion needs a distribution as a target, instead of an index.
    -- So we need to transform the entire label vectors:
    
    if trainData then
        -- convert training labels
        local trsize = (#trainData.labels)[1] 
        local trlabels = torch.Tensor(trsize,noutputs)
        trlabels:fill(-1)
        for i=1,trsize then
            trlabels[{i,trainData.labels[1]}] =1 -- 1表示屬於該類
        end
        trainData.labels=trlabels
        
        -- convert test labels
        local tesize = testData.labels:size()[1]
        local telabels = torch.Tensor(tesize,noutputs):fill(-1)
        for i=1,tesize do
            telabels[{{i},{testData.labels[i]}}]=1
        end
        testData.labels=telabels
    end
else
    error('unknown -loss')
end

print ('損失函數爲')
print (criterion)

能夠發現損失函數的定義很簡單，都是一句話的事，只是在調用對應的損失函數時要注意損失函數的輸入輸出形式。更多的損失函數定義和使用方法見torch/nn/Criterionssvg

模型的訓練

加載模塊函數

require 'torch'
require 'xlua'          -- 主要用於顯示進度條
require 'optim'         -- 包含各類優化算法，以及混淆矩陣

預約義命令行工具

if not opt then
    print '==> processiing options:'
    cmd=torch.CmdLine()
    cmd:text()
    cmd:text('options:')
    cmd:text()
    cmd:option('-save','results','subdirectory to save/log experiments in') --結果保存路徑
    cmd:option('-visualize',false,'visualize input data and weights during training')
    cmd:option('-plot',false,'live plot') -- 這兩個參數能夠參見optim/Logger的用法
    -- 下面的幾個參數就是關於優化函數和對應參數的了
    cmd:option('-optimization','SGD','optimization method: SGD | ASGD | CG | LBFGS')
    -- 分別是隨機梯度降低法、平均梯度降低法、共軛梯度法、線性BFGS搜索方法
    cmd:option('-learningRate',1e-3,'learning rate at t=0') -- 步長
    cmd:option('-batchSize',1,'mini-batch size (1 = pure stochastic)') -- 批量梯度降低法的大小，當大小爲1時就是隨機梯度降低法
    cmd:option('-weightDecay',0,'weight decay (SGD only)') -- 正則項係數衰減速度
    cmd:option('-momentum',0,'momentum (SGD only)')  --慣性系數
    cmd:option('-t0',1, 'start averaging at t0 (ASGD only) in nb of epochs) cmd:option('-maxIter',2,'maximum nb of iterations for CG and LBFGS') --最大迭代次數，CG和LBFGS使用 cmd:text() end

這裏要說明下。傳統的隨機梯度降低法，通常就是，其中是上一步的梯度，是學習速率，就是步長，步長太大容易致使震盪，步長過小容易致使收斂較慢且可能掉進局部最優勢，因此，通常算法開始時會有相對大一點的步長，而後步長會逐步衰減。
爲了使BP算法有更好的收斂性能，能夠在權值的更新過程當中引入「慣性項」，也就是上一次的梯度方向和這一次梯度方向的合成方向做爲新的搜索方向，,這裏的慣性系數就是參數momentum性能

正則項主要是爲了防止模型過擬合，控制模型的複雜度。學習

定義了一些分析工具測試

classes = {'1','2','3','4','5','6','7','8','9','0'}

confusion = optim.ConfusionMatrix(classes) -- 定義混淆矩陣用於評價模型性能，後續計算正確率，召回率等
trainLogger = optim.Logger(paths.concat(opt.save,'train.log'))
testLogger = optim.Logger(paths.concat(opt.save,'test.log'))
-- 建立了兩個記錄器，保存訓練日誌和測試日誌

混淆矩陣參見混淆矩陣，optim裏面的ConfusionMatrix 主要使用到的有三個量一個是 valid，也就是召回率 TPR(True Positive Rate), 一個是 unionValid，這個值是召回率和正確率的一個綜合值 unionValid = M(t,t)/(行和+列和-M(t,t)),M(t,t)表示矩陣對角線的第t個值
最後一個就是總體的評價指標 totalValid = sum(diag(M))/sum(M(:))

開始訓練

if model then 
    parameters,gradParameters = model:getParameters()
end

注意 torch中模型參數更新方式有兩種，一種直接調用函數updateParameters(learningRate)更新，另外一種就要手工更新，即parameters:add(-learningRate,gradParameters),具體請參看torch/nn/overview

接下來定義訓練函數

function train()
    epoch = epoch or 1
    -- 全部樣本循環的次數
    local time = sys.clock() -- 當前時間
    shuffle =torch.randperm(trsize) -- 將樣本次序隨機排列permutation
    for t=1,trsize,opt.batchSize do --批處理，批梯度降低
        xlua.progress(t,trainData:size()) --進度條
        inputs={} --存儲該批次的輸入
        targets ={} -- 存儲該批次的真實標籤
        for i=t,math.min(t+opt.batchSize-1,trainData:size()) do --min操做是處理不能整分的狀況
            local input = trainData.data[shuffle[i]]:double()
            local target = trainData.labels[shuffle[i]]
            table.insert(inputs,input)
            table.inset(targets,target)
        end
        
        -- 定義局部函數，這個函數做爲優化函數的接口函數
        local feval = function(x)
            if x~=parameters then
                parameters:copy(x)
            end
            
            gradParameters:zero() -- 每一次更新過程都要清零梯度
            local f=0 -- 累積偏差
            for i=1,#inputs do
                local output = model:forward(inputs[i])
                local err = criterion:forward(output,targets[i]) -- 前向計算
                f=f+err -- 累積偏差
                
                local df_do = criterion:backward(output,targets[i])  -- 反向計算損失層梯度
                model:backward(inputs[i],df_do)         -- 反向計算梯度，這裏的梯度已將保存到gradParameters中，下面會解釋爲何
                
                local _, indice = torch.sort(output,true)
                confusion:add(indices[1],targets[i])
                -- 更新混淆矩陣，參數分別爲預測值和真實值，add操做是在混淆矩陣的[真實值][預測值]位置加1
                -- ==Note==須要注意的是，教程上這裏代碼錯了，他沒有對output進行排序，而是直接將output放入confusion的更新參數中，可是output是一個向量，那樣會致使獲得的矩陣只有一行更新。。。我排查了很久。。。
            end
            
            gradParamters:div(#inputs)
            f=f/#inputs
            -- 由於是批處理，因此這裏應該計算均值
            return f, gradParameters
        end
        -- feval 這個函數的形式能夠參見優化方法的定義,下面有連接
        -- 開始優化
        if opt.optimization == 'CG' then 
            config = config or {maxiter = opt.maxIter}
            optim.cg(feval,parameters,config)
        elseif opt.optimization == 'SGD' then 
            config =config or {learning = opt.learningRate,
                            weightDecay = opt.weightDecay,
                            learningRateDecay = 5e-7}   --最後一個參數是步長的衰減速率
            optim.sgd(feval,parameters,config)
        elseif opt.optimization=='LBFGS' then
            config =config or {learning = opt.learningRate,
                            maxIter =opt.maxIter,
                            nCorrection = 10}
            optim.lbfgs(feval,parameters,config)
        elseif opt.optimization=='ASGD' then
            config = config or {eta0 = opt.learningRate, t0 = trsize*opt.t0}
            _,_,average = optim.asgd(feval,parameters,config)
        else
            error ('unknown -optimization method')
        end
    end
    -- 這裏關於各類優化函數的原型請參考[1]
    
    -- 遍歷一次進行記錄
    time =sys.clock()-time --時間
    time =time/trainData:size() -- 平均時間
    
    print(confusion) --這裏顯示了混淆矩陣
    -- confusion:zero() --混淆矩陣清零爲了下一次遍歷 注意！文檔中這句話也放錯了位置，由於還沒log不能清空，應該放到後面
    
    trainLogger:add{['% mean class accuracy (train set)'] = confusion.totalValid*100} -- 這個地方保存的是 accuracy
    if opt.plot then
        trainLogger:style{['% mean class accuracy (train set)']='-'}
        trainLogger.plot()  -- 繪製隨着迭代進行，結果的變化趨勢圖
    end
    confusion:zero() --混淆矩陣清零爲了下一次遍歷 應該放到這裏
    local filename = paths.concat(opt.save,'model.net')
    os.excute('mkdir -p ' .. sys.dirname(filename)) --建立文件
    torch.save(filename,model) --在新文件中保存模型
    
    epoch =epoch+1
    
end

這裏稍微有點難以理解的是，每一次計算梯度，梯度是怎麼更新的呢？咱們並無顯示的見到梯度是如何更新的。
這主要是由於 'parameters,gradParameters = model:getParameters()'這個函數其實返回的是指針，而後在優化函數中對參數進行了更新，好比咱們看看 sgd中有部分代碼

...
        x:add(-clr,state.deltaParameters)
    else
        x:add(-clr,dfdx)
    end

這裏x就是咱們調用時輸入的parameters指針,dfdx就是調用的函數feval返回的gradParameters指針。
另外 'model:backward(inputs[i],df_do)'函數內部修改了gradParamters上的值，由於指針傳遞，因此沒有返回值。

補充一點 epoch，batchSize和iteration關係
隨機梯度法是將全部的樣本一次送到模型中進行訓練，那麼後輸入的樣本調整了模型後並不能保證以前的樣本得到的結果仍然很好，這時候就要重複的輸入樣本，讓系統慢慢慢慢的收斂到對全部的樣本都能有一個較好的結果。
而1個epoch就等於將全部的訓練集中的樣本訓練一次
1個batchSize是每次進行梯度更新所採用的樣本的個數，若是batchsize=1的話就是最簡單的隨機梯度降低法，batchSize=#{訓練集}，那麼就是梯度降低法
1個iteration 等於使用batchsize個樣本訓練一次