caffe中的SGDSolver類中實現了帶動量的梯度降低法,其原理以下,\(lr\)爲學習率,\(m\)爲動量參數。html
history_data = local_rate * param_diff + momentum * history_data
param_diff = history_data
param_data = param_data - param_diff
步驟1和步驟2均在SGDSolver類的ComputeUpdateValue()
函數中實現,步驟3對每一個優化方法來講都是相同的,代碼可參考以前的博客:Caffe源碼-SGDSolver類。git
NAG算法在NesterovSolver類中實現,NAG與SGD相比惟一區別在於梯度的計算上。如上,SGD使用的梯度是參數\(\theta_{t}\)在當前位置的梯度\(\nabla_{\theta_{t}}\),而NAG中使用的是當前參數\(\theta_{t}\)在施加了動量以後的位置的梯度\(\nabla_{(\theta_{t}-m*\nu_{t})}\),其原理爲:github
網絡上有一張常見的圖用於表示SGD和NAG的過程。
算法
對於SGD算法,藍色向量\(p_{1}\)爲當前參數\(\theta_{t}\)在該位置的梯度\(lr*\nabla_{\theta_{t}}\),藍色向量\(p_{2}\)爲動量\(m*\nu_{t}\),而\(p_{1}+p_{2}\)即爲參數一次的更新量\(\Delta\theta_{t+1}\)。
對於NAG算法,\(O_{1}\)爲參數\(\theta_{t}\)的初始位置,棕色向量\(p_{3}=p_{2}\),先計算運用動量後的參數\(\tilde{\theta}_{t+1}\)的位置\(O_{2}\),而後計算該位置梯度\(\nabla_{\tilde{\theta}_{t+1}}\),即爲圖中的紅色向量\(p_{4}\),而\(p_{5}=p_{3}+p_{4}\)即爲參數一次的更新量\(\Delta\theta_{t+1}=\nu_{t+1}\)。以後仿照該步驟計算下一次迭代的動量\(m*\nu_{t+1}\)(棕色向量\(p_{6}\))和梯度\(\nabla_{\tilde{\theta}_{t+2}}\)(紅色向量\(p_{7}\)),獲得更新量\(p_{8}\)。網絡
NAG算法的原理仍是很好理解的,可是實現起來卻有一個很是難理解的地方,即如何計算參數臨時更新位置的梯度\(\nabla_{\tilde{\theta}_{t+1}}\)?神經網絡這種複雜的系統中想要根據當前位置的梯度\(\nabla_{\theta_{t}}\)來估算另外一位置的梯度\(\nabla_{\tilde{\theta}_{t+1}}\)幾乎是不可能的。網絡上關於該算法的實現細節很是少,不過結合caffe代碼和其餘的開源代碼等,能夠判斷出,NAG算法每次迭代時保存的參數是臨時參數\(\tilde{\theta}_{t+1}\)(位置\(O_{2}\)),而非初始\(O_{1}\)位置處的參數\(\theta_{t}\),這樣每次反向傳播計算出的梯度實際上就是紅色向量\(p_{4}\)。而後每次更新時,會根據動量\(p_{3}\)先將參數從位置\(O_{2}\)退回\(O_{1}\),而後計算獲得一次迭代的更新量\(p_{5}\),使參數更新\(\theta_{t+1}\)(位置\(O_{3}\)),並保存下一次迭代時須要使用的臨時參數\(\tilde{\theta}_{t+2}\)(位置\(O_{4}\))。app
//根據當前迭代次數對應的學習率rate,計算網絡中第param_id個可學習參數在更新時使用的梯度 template <typename Dtype> void NesterovSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //網絡中的全部可學習參數 const vector<float>& net_params_lr = this->net_->params_lr(); //網絡中每一個參數對應的學習率係數 Dtype momentum = this->param_.momentum(); //求解器設置的動量 Dtype local_rate = rate * net_params_lr[param_id]; //獲得當前參數對應的學習率 switch (Caffe::mode()) { case Caffe::CPU: { //CPU模式 // save history momentum for stepping back caffe_copy(net_params[param_id]->count(), this->history_[param_id]->cpu_data(), this->update_[param_id]->mutable_cpu_data()); //將歷史數據history_拷貝至update_中,update_data = history_data // update history //history_data = local_rate * net_params_diff + momentum * history_data caffe_cpu_axpby(net_params[param_id]->count(), local_rate, net_params[param_id]->cpu_diff(), momentum, this->history_[param_id]->mutable_cpu_data()); // compute update: step back then over step //update_data = (1 + momentum) * history_data + (-momentum) * update_data caffe_cpu_axpby(net_params[param_id]->count(), Dtype(1) + momentum, this->history_[param_id]->cpu_data(), -momentum, this->update_[param_id]->mutable_cpu_data()); // copy //net_params_diff = update_data caffe_copy(net_params[param_id]->count(), this->update_[param_id]->cpu_data(), net_params[param_id]->mutable_cpu_diff()); break; } case Caffe::GPU: { #ifndef CPU_ONLY // gpu的操做同理 // h_temp = history_data // history_data = momentum * h_temp + local_rate * net_params_diff // net_params_diff = (1+momentum) * history_data - momentum * h_temp nesterov_update_gpu(net_params[param_id]->count(), net_params[param_id]->mutable_gpu_diff(), this->history_[param_id]->mutable_gpu_data(), momentum, local_rate); #else NO_GPU; #endif break; } default: LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); } }
對應上述的說明,代碼中的各步操做爲:ide
update_data = history_data
net_params_diff
爲臨時位置的參數的梯度\(\nabla_{\tilde{\theta}_{t+1}}\),計算新的動量:history_data = local_rate * net_params_diff + momentum * history_data
update_data = (1 + momentum) * history_data + (-momentum) * update_data
net_params_diff = update_data
AdaGrad算法經過縮放每一個參數反比於其全部梯度歷史平方值總和的平方跟,可以使得具備較大梯度的參數可以快速降低,使具備小偏導的參數可以緩慢降低。
其原理以下,初始累積變量\(r=0\),\(\delta\)爲較小常數,防止除法除數太小而不穩定。函數
template <typename Dtype> void AdaGradSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); const vector<float>& net_params_lr = this->net_->params_lr(); Dtype delta = this->param_.delta(); Dtype local_rate = rate * net_params_lr[param_id]; switch (Caffe::mode()) { case Caffe::CPU: { // compute square of gradient in update caffe_powx(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), Dtype(2), this->update_[param_id]->mutable_cpu_data()); //update_data = net_params ^ 2 // update history caffe_add(net_params[param_id]->count(), this->update_[param_id]->cpu_data(), this->history_[param_id]->cpu_data(), this->history_[param_id]->mutable_cpu_data()); //history_data = update_data + history_data // prepare update caffe_powx(net_params[param_id]->count(), this->history_[param_id]->cpu_data(), Dtype(0.5), this->update_[param_id]->mutable_cpu_data()); //update_data = history_data ^ 0.5 caffe_add_scalar(net_params[param_id]->count(), delta, this->update_[param_id]->mutable_cpu_data()); //update_data += delta caffe_div(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), this->update_[param_id]->cpu_data(), this->update_[param_id]->mutable_cpu_data()); //update_data = net_params_diff / update_data // scale and copy caffe_cpu_axpby(net_params[param_id]->count(), local_rate, this->update_[param_id]->cpu_data(), Dtype(0), net_params[param_id]->mutable_cpu_diff()); //net_params_diff = local_rate * update_data + 0 * net_params_diff break; } case Caffe::GPU: { //gpu操做同理 #ifndef CPU_ONLY // gi = net_params_diff; // hi = history_data = history_data + gi*gi; // net_params_diff = local_rate * gi / (sqrt(hi) + delta); adagrad_update_gpu(net_params[param_id]->count(), net_params[param_id]->mutable_gpu_diff(), this->history_[param_id]->mutable_gpu_data(), delta, local_rate); #else NO_GPU; #endif break; } default: LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); } }
AdaGrad/RMSProp/AdaDelta/Adam算法的caffe代碼很容易找到對應的公式,再也不詳細介紹。學習
RMSProp算法在AdaGrad基礎上增長一個衰減係數\(\rho\),以便將很早以前的歷史梯度數據丟棄。
其原理以下,初始累積變量\(r=0\),\(\delta\)一樣爲較小常數。優化
template <typename Dtype> void RMSPropSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //全部可學習參數 const vector<float>& net_params_lr = this->net_->params_lr(); //參數對應的學習率係數 // get the learning rate Dtype delta = this->param_.delta(); //常數delta Dtype rms_decay = this->param_.rms_decay(); //衰減速率 Dtype local_rate = rate * net_params_lr[param_id]; //參數對應的學習率 switch (Caffe::mode()) { case Caffe::CPU: // compute square of gradient in update caffe_powx(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), Dtype(2), this->update_[param_id]->mutable_cpu_data()); //update_data = net_params_diff ^ 2 // update history //history_data = (1-rms_decay) * update_data + rms_decay * history_data caffe_cpu_axpby(net_params[param_id] -> count(), Dtype(1-rms_decay), this->update_[param_id]->cpu_data(), rms_decay, this->history_[param_id]-> mutable_cpu_data()); // prepare update caffe_powx(net_params[param_id]->count(), this->history_[param_id]->cpu_data(), Dtype(0.5), this->update_[param_id]->mutable_cpu_data()); //update_data = history_data ^ 0.5 caffe_add_scalar(net_params[param_id]->count(), delta, this->update_[param_id]->mutable_cpu_data()); //update_data += delta //update_data = net_params_diff / update_data caffe_div(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), this->update_[param_id]->cpu_data(), this->update_[param_id]->mutable_cpu_data()); // scale and copy caffe_cpu_axpby(net_params[param_id]->count(), local_rate, this->update_[param_id]->cpu_data(), Dtype(0), net_params[param_id]->mutable_cpu_diff()); //net_params_diff = local_rate * update_data + 0 * net_params_diff break; case Caffe::GPU: #ifndef CPU_ONLY // g = net_params_diff // h = history_data // gi = g[i]; // hi = h[i] = rms_decay*h[i] + (1-rms_decay)*gi*gi; // g[i] = local_rate * g[i] / (sqrt(hi) + delta); rmsprop_update_gpu(net_params[param_id]->count(), net_params[param_id]->mutable_gpu_diff(), this->history_[param_id]->mutable_gpu_data(), rms_decay, delta, local_rate); #else NO_GPU; #endif break; default: LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); } }
AdaDelta也像RMSProp算法同樣在AdaGrad基礎上增長一個衰減係數\(\rho\),而且還額外維護一個狀態量\(x\)。
其原理以下,初始累積變量\(x=0, r=0\),\(\delta\)一樣爲較小常數。
template <typename Dtype> void AdaDeltaSolver<Dtype>::AdaDeltaPreSolve() { //AdaDeltaSolver類在構造時會調用該函數 // Add the extra history entries for AdaDelta after those from SGDSolver::PreSolve const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //當前網絡中的全部可學習參數 for (int i = 0; i < net_params.size(); ++i) { const vector<int>& shape = net_params[i]->shape(); //第i個可學習參數的形狀 //在SGDSolver<Dtype>::PreSolve中history_已經存入一個與參數blob相同形狀的空blob,此處再存入一個 this->history_.push_back(shared_ptr<Blob<Dtype> >(new Blob<Dtype>(shape))); } } #ifndef CPU_ONLY template <typename Dtype> void adadelta_update_gpu(int N, Dtype* g, Dtype* h, Dtype* h2, Dtype momentum, Dtype delta, Dtype local_rate); #endif template <typename Dtype> void AdaDeltaSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //網絡中的全部可學習參數 const vector<float>& net_params_lr = this->net_->params_lr(); //每一個參數對應的學習率係數 Dtype delta = this->param_.delta(); //AdaDelta方法中的一個參數 Dtype momentum = this->param_.momentum(); //動量係數 Dtype local_rate = rate * net_params_lr[param_id]; //獲得當前參數對應的學習率 size_t update_history_offset = net_params.size(); //網絡的參數個數 //history_在AdaDeltaPreSolve()中又存入了一次與全部參數形狀相同的空blob,下面將 //history_[param_id]表示成 history_former, history_[update_history_offset + param_id]表示成 history_latter switch (Caffe::mode()) { case Caffe::CPU: { // compute square of gradient in update caffe_powx(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), Dtype(2), this->update_[param_id]->mutable_cpu_data()); //update_data = net_params_diff ^ 2 // update history of gradients //history_former_data = (1 - momentum) * update_data + momentum * history_former_data caffe_cpu_axpby(net_params[param_id]->count(), Dtype(1) - momentum, this->update_[param_id]->cpu_data(), momentum, this->history_[param_id]->mutable_cpu_data()); // add delta to history to guard against dividing by zero later caffe_set(net_params[param_id]->count(), delta, this->temp_[param_id]->mutable_cpu_data()); //temp_中每一個元素都置爲delta, temp_data = delta caffe_add(net_params[param_id]->count(), this->temp_[param_id]->cpu_data(), this->history_[update_history_offset + param_id]->cpu_data(), this->update_[param_id]->mutable_cpu_data()); //update_data = temp_data + history_latter_data caffe_add(net_params[param_id]->count(), this->temp_[param_id]->cpu_data(), this->history_[param_id]->cpu_data(), this->temp_[param_id]->mutable_cpu_data()); //temp_data = temp_data + history_former_data // divide history of updates by history of gradients caffe_div(net_params[param_id]->count(), this->update_[param_id]->cpu_data(), this->temp_[param_id]->cpu_data(), this->update_[param_id]->mutable_cpu_data()); //update_data = update_data / temp_data // jointly compute the RMS of both for update and gradient history caffe_powx(net_params[param_id]->count(), this->update_[param_id]->cpu_data(), Dtype(0.5), this->update_[param_id]->mutable_cpu_data()); //update_data = update_data ^ 0.5 // compute the update caffe_mul(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), this->update_[param_id]->cpu_data(), net_params[param_id]->mutable_cpu_diff()); //net_params_diff = net_params_diff * update_data // compute square of update caffe_powx(net_params[param_id]->count(), net_params[param_id]->cpu_diff(), Dtype(2), this->update_[param_id]->mutable_cpu_data()); //update_data = net_params_diff ^ 2 // update history of updates //history_latter_data = (1 - momentum) * update_data + momentum * history_latter_data caffe_cpu_axpby(net_params[param_id]->count(), Dtype(1) - momentum, this->update_[param_id]->cpu_data(), momentum, this->history_[update_history_offset + param_id]->mutable_cpu_data()); // apply learning rate caffe_cpu_scale(net_params[param_id]->count(), local_rate, net_params[param_id]->cpu_diff(), net_params[param_id]->mutable_cpu_diff()); //net_params_diff = local_rate * net_params_diff break; } case Caffe::GPU: { #ifndef CPU_ONLY // g = net_params_diff; // h = history_former_data; // h2 = history_latter_data; // gi = g[i]; // hi = h[i] = momentum * h[i] + (1-momentum) * gi * gi; // gi = gi * sqrt((h2[i] + delta) / (hi + delta)); // h2[i] = momentum * h2[i] + (1-momentum) * gi * gi; // g[i] = local_rate * gi; adadelta_update_gpu(net_params[param_id]->count(), net_params[param_id]->mutable_gpu_diff(), this->history_[param_id]->mutable_gpu_data(), this->history_[update_history_offset + param_id]->mutable_gpu_data(), momentum, delta, local_rate); #else NO_GPU; #endif break; } default: LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); } }
Adam算法包含兩個衰減參數\(\rho_{1}\)和\(\rho_{2}\),通常\(\rho_{1}=0.9, \rho_{2}=0.999\)。還包含一階矩和二階矩變量\(s, r\),時間步\(t\)。
初始時\(s=0, r=0, t=0\),\(\delta\)一樣爲較小常數。
計算梯度的更新量:\(\Delta\theta_{t+1}=lr*\frac{\tilde{s}_{t+1}}{\sqrt{\tilde{r}_{t+1}}+\delta}\)
應用更新:\(\theta_{t+1}=\theta_{t}-\Delta\theta_{t+1}\)
template <typename Dtype> void AdamSolver<Dtype>::AdamPreSolve() { // Add the extra history entries for Adam after those from SGDSolver::PreSolve const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //全部可學習參數 for (int i = 0; i < net_params.size(); ++i) { const vector<int>& shape = net_params[i]->shape(); //第i個可學習參數對應的形狀 this->history_.push_back(shared_ptr<Blob<Dtype> >(new Blob<Dtype>(shape))); //history_再存入一個與參數大小相同的空blob } } #ifndef CPU_ONLY template <typename Dtype> void adam_update_gpu(int N, Dtype* g, Dtype* m, Dtype* v, Dtype beta1, Dtype beta2, Dtype eps_hat, Dtype corrected_local_rate); #endif template <typename Dtype> void AdamSolver<Dtype>::ComputeUpdateValue(int param_id, Dtype rate) { const vector<Blob<Dtype>*>& net_params = this->net_->learnable_params(); //全部可學習參數 const vector<float>& net_params_lr = this->net_->params_lr(); //參數的學習率係數 Dtype local_rate = rate * net_params_lr[param_id]; //當前參數的學習率 const Dtype beta1 = this->param_.momentum(); //兩個動量係數 const Dtype beta2 = this->param_.momentum2(); // we create aliases for convenience size_t update_history_offset = net_params.size(); //history_的大小爲2 * update_history_offset Blob<Dtype>* val_m = this->history_[param_id].get(); Blob<Dtype>* val_v = this->history_[param_id + update_history_offset].get(); Blob<Dtype>* val_t = this->temp_[param_id].get(); const int t = this->iter_ + 1; //步數 const Dtype correction = std::sqrt(Dtype(1) - pow(beta2, t)) / (Dtype(1.) - pow(beta1, t)); //correction = sqrt(1 - beta2 ^ t) / (1 - beta1 ^ t) const int N = net_params[param_id]->count(); //參數的元素個數 const Dtype eps_hat = this->param_.delta(); //微小值 switch (Caffe::mode()) { case Caffe::CPU: { // update m <- \beta_1 m_{t-1} + (1-\beta_1)g_t caffe_cpu_axpby(N, Dtype(1)-beta1, net_params[param_id]->cpu_diff(), beta1, val_m->mutable_cpu_data()); //val_m = (1 - beta1) * net_params_diff + beta1 * val_m // update v <- \beta_2 m_{t-1} + (1-\beta_2)g_t^2 caffe_mul(N, net_params[param_id]->cpu_diff(), net_params[param_id]->cpu_diff(), val_t->mutable_cpu_data()); //val_t = net_params_diff * net_params_diff caffe_cpu_axpby(N, Dtype(1)-beta2, val_t->cpu_data(), beta2, val_v->mutable_cpu_data()); //val_v = (1 - beta2) * val_t + beta2 * val_v // set update caffe_powx(N, val_v->cpu_data(), Dtype(0.5), val_t->mutable_cpu_data()); //val_t = val_v ^ 0.5 caffe_add_scalar(N, eps_hat, val_t->mutable_cpu_data()); //val_t += eps_hat caffe_div(N, val_m->cpu_data(), val_t->cpu_data(), val_t->mutable_cpu_data()); //val_t = val_m / val_t caffe_cpu_scale(N, local_rate*correction, val_t->cpu_data(), net_params[param_id]->mutable_cpu_diff()); //net_params_diff = local_rate*correction * val_t break; } case Caffe::GPU: { #ifndef CPU_ONLY // g = net_params_diff // m = val_m // v = val_v // gi = g[i]; // mi = m[i] = m[i]*beta1 + gi*(1-beta1); // vi = v[i] = v[i]*beta2 + gi*gi*(1-beta2); // g[i] = local_rate * correction * mi / (sqrt(vi) + eps_hat); adam_update_gpu(N, net_params[param_id]->mutable_gpu_diff(), val_m->mutable_gpu_data(), val_v->mutable_gpu_data(), beta1, beta2, eps_hat, local_rate*correction); #else NO_GPU; #endif break; } default: LOG(FATAL) << "Unknown caffe mode: " << Caffe::mode(); } }
Caffe的源碼筆者是第一次閱讀,一邊閱讀一邊記錄,對代碼的理解和分析可能會存在錯誤或遺漏,但願各位讀者批評指正,謝謝支持!