caffe源碼閱讀

時間 2019-11-16

標籤 caffe 源碼閱讀简体版

原文原文鏈接

caffe源碼閱讀

dl caffe

結構

主要兩個目錄
src: 包含源碼實現
include: 頭文件html

src目錄的架構，主要代碼在caffe目錄中，包含net.cpp, solver.cpp, blob.cpp, layer.cpp, blob.cpp, common.cpp, layers目錄主要包含一些層，是caffe核心。proto中只有一個caffe.proto文件，裏面使用protobuf語言描述了各類對象的成員變量, solvers主要提供不一樣的優化器，sgd, adam, rmsprop, adagrad，test目錄包含一些單元測試用例, util經常使用工具函數：python

├── caffe
│   ├── layers
│   ├── proto
│   ├── solvers
│   ├── test
│   │   └── test_data
│   └── util
└── gtest

首先來看caffe目錄下的幾個cpp：編程

blob.cpp
common.cpp
data_transformer.cpp
internal_thread.cpp
layer.cpp
layer_factory.cpp
net.cpp
parallel.cpp
solver.cpp
syncedmem.cpp

blob.cpp是caffe中主要的數據傳輸類型。
common.cppapi

從tools出發

在根目錄下有一個tools目錄，主要用來編譯一個caffe的可執行檔，裏面提供了caffe的一些可執行參數，經過配置參數來達到使用caffe的目的。數組

caffe.cpp
compute_image_mean.cpp
convert_imageset.cpp
device_query.cpp
extract_features.cpp
finetune_net.cpp
net_speed_benchmark.cpp
test_net.cpp
train_net.cpp
upgrade_net_proto_binary.cpp
upgrade_net_proto_text.cpp
upgrade_solver_proto_text.cpp

main.cpp中分別註冊了幾個函數到g_brew_map中，分別是train, test, time, device_query。網絡

首先來看train函數，使用一個solver_param對象來解析solver參數，架構

caffe::SolverParameter solver_param;
  caffe::ReadSolverParamsFromTextFileOrDie(FLAGS_solver, &solver_param);

經過SolverRegistery::CreateSolver建立一個solver對象, solver對象有一個 shared_ptr<Net<Dtype> > net_成員變量：app

shared_ptr<caffe::Solver<float> >
      solver(caffe::SolverRegistry<float>::CreateSolver(solver_param));

Net對象是整個網絡的主體，那麼一個Net究竟包含什麼呢？最主要的是三個變量, layers_, params_, blobs_，以下：框架

template <typename Dtype>
class Net {
private:
  vector<shared_ptr<Layer<Dtype> > > layers_;
  vector<shared_ptr<Blob<Dtype> > > params_;
  vector<shared_ptr<Blob<Dtype> > > blobs_;
};

layers_是構成網絡的基本組件; params_是每層的濾波器參數，這個變量和每層layer的blobs_變量是共享數據的，即這邊的params_存儲的是layer的blobs_的指針; blobs_是各層的中間數據。dom

Net構造函數接收一個NetParameter參數，只是調用了一下Init函數：

template <typename Dtype>
Net<Dtype>::Net(const NetParameter& param) {
  Init(param);
}

NetParameter在caffe.proto的定義以下：

message NetParameter {
  optional string name = 1; 
  repeated string input = 3;
  repeated BlobShape input_shape = 8;
  repeated int32 input_dim = 4;
  optional bool force_backward = 5 [default = false];
  optional NetState state = 6;
  repeated LayerParameter layer = 100;  // ID 100 so layers are printed last.
}

message LayerParameter {
  optional string name = 1; // the layer name
  optional string type = 2; // the layer type
  repeated string bottom = 3; // the name of each bottom blob
  repeated string top = 4; // the name of each top blob
  // The blobs containing the numeric parameters of the layer.
  repeated BlobProto blobs = 7;
  optional TransformationParameter transform_param = 100;
}

NetParameter的核心是LayerParameter，
LayerParamter（定義進行了簡化）的核心是bottom名, top名, 以及參數blobs。

這個NetParamter利用protobuf從train.prototxt, vgg.caffemodel進行讀取初始化，而後去構造Net對象，有了Net整個網絡也就搭建起來了。

以後能夠調用solver->Solve();函數來開始整個網絡的訓練，而在Solve()函數中，則調用Step()函數，Step()函數主要用來進行每次的迭代，裏面有個循環，每一個循環是一次iter，每一個iter進行iter_size次前向反向傳播(FowardBackward())，並對這個batch的loss取平均更新優化器。

這裏的iter_size參數是爲了防止因爲GPU內存不足致使沒法使用較大的batch size帶來的問題，由於它實際更新loss的迭代次數是iter_size * batch_size，這樣就能夠與使用較大的batch size是相同的結果。例如網絡在batch_size = 128時取得較好的結果，但因爲GPU內存不夠，只夠32張圖片，那麼能夠將batch_size設爲32,將iter_size設爲4,取得的效果與batch_size = 128同樣。

while (iter_ < stop_iter) {
    // ...
  	 Dtype loss = 0;
    for (int i = 0; i < param_.iter_size(); ++i) {
      loss += net_->ForwardBackward();
    }
    loss /= param_.iter_size();
    // average the loss across iterations for smoothed reporting
    UpdateSmoothedLoss(loss, start_iter, average_loss);
    // ...
    ApplyUpdate();
    // ...
  }

查看FowardBackward()實現以下，分別進行了Forward, Backward，並在前向傳播時記錄了loss：

Dtype ForwardBackward() {
    Dtype loss;
    Forward(&loss);
    Backward();
    return loss;
  }

再看Foward(&loss)實現，調用了FowardFromTo(0, layers_.size() - 1)函數：

template <typename Dtype>
const vector<Blob<Dtype>*>& Net<Dtype>::Forward(Dtype* loss) {
  if (loss != NULL) {
    *loss = ForwardFromTo(0, layers_.size() - 1);
  } else {
    ForwardFromTo(0, layers_.size() - 1);
  }
  return net_output_blobs_;
}

FowardFromTo(0, layers_.szie()-1)遍歷了每一個層，使每一個層分別調用Forward()函數，bottom_vecs_,top_vecs_的類型是vector<vector<Blob<Dtype>*> >，傳入每層的類型是vector<Blob<Dtype>*>，這個vector表示層可能有多個輸入或輸出：

template <typename Dtype>
Dtype Net<Dtype>::ForwardFromTo(int start, int end) {
  Dtype loss = 0;
  for (int i = start; i <= end; ++i) {
    Dtype layer_loss = layers_[i]->Forward(bottom_vecs_[i], top_vecs_[i]);
    loss += layer_loss;
  }
  return loss;
}

因此，以上的solver, net都是爲了layer服務，核心的功能實現仍是在layer當中，咱們先來看卷積層(conv_layer.cpp)的Forward實現。

LayerFactory: 工廠模式

爲了對layer有足夠的理解，咱們先來閱讀與layer相關的對象。全部layer的基類是Layer，因爲實現的類都是使用模板編程，若是沒有靜態地調用相關模板類，編譯器是不會進行特化的。而咱們的調用過程都是經過配置文件train.prototxt進行動態初始化相關的類，這樣就會發現找不到這個類。爲了不這個問題，在類定義後面都進行一下聲明，這樣確保在使用的時候能夠找到這個類，使用的是一個宏：

INSTANTIATE_CLASS(ConvolutionLayer);

宏的定義以下：

#define INSTANTIATE_CLASS(classname) \ char gInstantiationGuard##classname; \ template class classname<float>; \
  template class classname<double>

實際上就是聲明瞭一下ConvolutionLayer<float>, ConvolutionLayer<double>：

char gInstantiationGuardConvolutionLayer; 
  template class ConvolutionLayer<float>; 
  template class ConvolutionLayer<double>;

除此以外，有那麼多的Layer，caffe實現了一個工廠模型(layer_factory.cpp)，將layer進行統一管理，也就是須要將全部Layer都註冊到一個map，裏面的key對應Layer名，value是生成相應的Layer函數，這樣在使用的時候就能夠根據類型實例化相應的Layer對象了。提供了兩個宏定義：

#define REGISTER_LAYER_CREATOR(type, creator) \ static LayerRegisterer<float> g_creator_f_##type(#type, creator<float>); \
  static LayerRegisterer<double> g_creator_d_##type(#type, creator<double>) \


#define REGISTER_LAYER_CLASS(type) \ template <typename Dtype> \
  shared_ptr<Layer<Dtype> > Creator_##type##Layer(const LayerParameter& param) \ { \ return shared_ptr<Layer<Dtype> >(new type##Layer<Dtype>(param)); \
  }                                                                            \
  REGISTER_LAYER_CREATOR(type, Creator_##type##Layer)

先看第一個宏，傳入兩個參數，一個是類型(Convolution)，第二個是建立函數，如在layer_factory.cpp中有以下代碼(進行了簡化):

template <typename Dtype>
shared_ptr<Layer<Dtype> > GetConvolutionLayer(const LayerParameter& param) {
 // 簡化...
 return shared_ptr<Layer<Dtype> >(new ConvolutionLayer<Dtype>(param));
 // 簡化...
}

REGISTER_LAYER_CREATOR(Convolution, GetConvolutionLayer);

那麼宏翻譯過來就是以下：

static LayerRegisterer<float> g_creator_f_Convolution("Convolution", creator<float>);    
 static LayerRegisterer<double> g_creator_d_Convolution("Convolution", creator<double>) ;

因此咱們再來看看LayerRegisterer這個類幹了什麼：

LayerRegistry<Dtype>::AddCreator(type, creator);

調用了靜態函數LayerRegistry<Dtype>::AddCreator，繼續看：

class LayerRegistry {
public:

  static CreatorRegistry& Registry() {
    static CreatorRegistry* g_registry_ = new map<string, Creator>();
    return *g_registry_;
  }

  static void AddCreator(const string& type, Creator creator) {
    CreatorRegistry& registry = Registry();
    CHECK_EQ(registry.count(type), 0) << "Layer type " << type << " already registered.";
    registry[type] = creator;
  }
}

能夠看到維護了一個單例map類型對象g_registry_，這個對象存儲了類型與對應的建立函數。

第二個宏，假如是這樣調用REGISTER_LAYER_CLASS(Convolution)，則能夠翻譯成下面的樣子：

template <typename Dtype>                                                    
  shared_ptr<Layer<Dtype> > Creator_ConvolutionLayer(const LayerParameter& param) 
  {                                                                            
    return shared_ptr<Layer<Dtype> >(new ConvolutionLayer<Dtype>(param));           
  }                                                                            
  REGISTER_LAYER_CREATOR(type, Creator_ConvolutionLayer)

就是這個類不須要特殊建立，直接使用這個默認建立方法(Creator_ConvolutionLayer)就能夠。而一些特殊的例子好比Convolution要進行其它的處理，因此要特殊寫建立函數(GetConvolutionLayer)，固然大多數層均可以直接調用這個默認的函數進行建立。

數據Blob

caffe中的數據的基本存儲、操做對象就是Blob，還提供了CPU、GPU數據同步功能。
Blob的數據基本存儲就是數組，是按照行存儲的。
Blob主要存儲了兩個數據，data_, diff_，分別是數據與梯度。

blob是一個四維的數組。維度從高到低分別是:(num_，channels_，height_，width_)對於圖像數據來講就是：圖片個數，彩色通道個數，寬，高，好比說有10張圖片，分別是512*256大小，彩色三通道，則爲：（10，3，256，512）：

template <typename Blob>
 class Blob {
 public:
  inline int num() const { return LegacyShape(0); }
  inline int channels() const { return LegacyShape(1); }
  inline int height() const { return LegacyShape(2); }
  inline int width() const { return LegacyShape(3); }
  inline const shared_ptr<SyncedMemory>& data() const {
    return data_;
  }
  inline const shared_ptr<SyncedMemory>& diff() const {
    return diff_;
  }
  void Update() {
      caffe_axpy<Dtype>(count_, Dtype(-1), static_cast<const Dtype*>(diff_->cpu_data()), static_cast<Dtype*>(data_->mutable_cpu_data()));
  };                   // 數據更新，即減去當前計算出來的梯度
  void FromProto(const BlobProto& proto, bool reshape = true);   // 將數據進行反序列化，從磁盤導入以前存儲的blob
  void ToProto(BlobProto* proto, bool write_diff = false) const; // 將數據進行序列化，便於存儲


 protected:
  shared_ptr<SyncedMemory> data_;
  shared_ptr<SyncedMemory> diff_;
  shared_ptr<SyncedMemory> shape_data_;
  vector<int> shape_;
  int count_;
  int capacity_;

  DISABLE_COPY_AND_ASSIGN(Blob);
};  // class Blob

回到Layer

Layer基類的Forward方法，注意這並不是是一個virtual方法，也就意味着它不但願子類對這個函數進行修改，便可以認爲全部Layer都是使用的這個Forward函數，因此咱們來看看具體的步驟：

template <typename Dtype>
class Layer {
public:
  explicit Layer(const LayerParameter& param) : layer_param_(param) {
    phase_ = param.phase();
    if (layer_param_.blobs_size() > 0) {
      blobs_.resize(layer_param_.blobs_size());
      for (int i = 0; i < layer_param_.blobs_size(); ++i) {
        blobs_[i].reset(new Blob<Dtype>());
        blobs_[i]->FromProto(layer_param_.blobs(i));
      }
    }
  }
  virtual ~Layer() {}

  void SetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
    CheckBlobCounts(bottom, top);
    LayerSetUp(bottom, top);
    Reshape(bottom, top);
    SetLossWeights(top);
  }

  /** * @brief Does layer-specific setup: your layer should implement this function * as well as Reshape. * * @param bottom * the preshaped input blobs, whose data fields store the input data for * this layer * @param top * the allocated but unshaped output blobs * * This method should do one-time layer specific setup. This includes reading * and processing relevent parameters from the <code>layer_param_</code>. * Setting up the shapes of top blobs and internal buffers should be done in * <code>Reshape</code>, which will be called before the forward pass to * adjust the top blob sizes. */
  virtual void LayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {}

  /** * @brief Adjust the shapes of top blobs and internal buffers to accommodate * the shapes of the bottom blobs. * * @param bottom the input blobs, with the requested input shapes * @param top the top blobs, which should be reshaped as needed * * This method should reshape top blobs as needed according to the shapes * of the bottom (input) blobs, as well as reshaping any internal buffers * and making any other necessary adjustments so that the layer can * accommodate the bottom blobs. */
  virtual void Reshape(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) = 0;

  /** * @brief Given the bottom blobs, compute the top blobs and the loss. * * @param bottom * the input blobs, whose data fields store the input data for this layer * @param top * the preshaped output blobs, whose data fields will store this layers' * outputs * \return The total loss from the layer. * * The Forward wrapper calls the relevant device wrapper function * (Forward_cpu or Forward_gpu) to compute the top blob values given the * bottom blobs. If the layer has any non-zero loss_weights, the wrapper * then computes and returns the loss. * * Your layer should implement Forward_cpu and (optionally) Forward_gpu. */
  inline Dtype Forward(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top);

  /** * @brief Given the top blob error gradients, compute the bottom blob error * gradients. * * @param top * the output blobs, whose diff fields store the gradient of the error * with respect to themselves * @param propagate_down * a vector with equal length to bottom, with each index indicating * whether to propagate the error gradients down to the bottom blob at * the corresponding index * @param bottom * the input blobs, whose diff fields will store the gradient of the error * with respect to themselves after Backward is run * * The Backward wrapper calls the relevant device wrapper function * (Backward_cpu or Backward_gpu) to compute the bottom blob diffs given the * top blob diffs. * * Your layer should implement Backward_cpu and (optionally) Backward_gpu. */
  inline void Backward(const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom);

  vector<shared_ptr<Blob<Dtype> > >& blobs() {
    return blobs_;
  }
  const LayerParameter& layer_param() const { return layer_param_; }

 protected:
  /** The protobuf that stores the layer parameters */
  LayerParameter layer_param_;  //層的參數: 卷積核大小，步長
  Phase phase_;
  /** The vector that stores the learnable parameters as a set of blobs. */
  vector<shared_ptr<Blob<Dtype> > > blobs_; //濾波器參數
  vector<bool> param_propagate_down_;
  vector<Dtype> loss_;

  virtual void Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) = 0;

  virtual void Forward_gpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top) {
    // LOG(WARNING) << "Using CPU code as backup.";
    return Forward_cpu(bottom, top);
  }

  virtual void Backward_cpu(const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) = 0;

  virtual void Backward_gpu(const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
    // LOG(WARNING) << "Using CPU code as backup.";
    Backward_cpu(top, propagate_down, bottom);
  }
  
 private:
  DISABLE_COPY_AND_ASSIGN(Layer);
};



template <typename Dtype>
inline Dtype Layer<Dtype>::Forward(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  Dtype loss = 0;
  Reshape(bottom, top);
  switch (Caffe::mode()) {
  case Caffe::CPU:
    Forward_cpu(bottom, top);
    for (int top_id = 0; top_id < top.size(); ++top_id) {
      if (!this->loss(top_id)) { continue; }
      const int count = top[top_id]->count();
      const Dtype* data = top[top_id]->cpu_data();
      const Dtype* loss_weights = top[top_id]->cpu_diff();
      loss += caffe_cpu_dot(count, data, loss_weights);
    }
    break;
  case Caffe::GPU:
    Forward_gpu(bottom, top);
#ifndef CPU_ONLY
    for (int top_id = 0; top_id < top.size(); ++top_id) {
      if (!this->loss(top_id)) { continue; }
      const int count = top[top_id]->count();
      const Dtype* data = top[top_id]->gpu_data();
      const Dtype* loss_weights = top[top_id]->gpu_diff();
      Dtype blob_loss = 0;
      caffe_gpu_dot(count, data, loss_weights, &blob_loss);
      loss += blob_loss;
    }
#endif
    break;
  default:
    LOG(FATAL) << "Unknown caffe mode.";
  }
  return loss;
}

在Layer中比較重要的幾個函數，Setup, LayerSetup, Reshape, Forward, BackWard, Forward_cpu, Forward_gpu, Backward_cpu, Backward_gpu。

Reshape, Forward_cpu, Backward_cpu函數是純虛函數，子類必定要對其進行實現；
LayerSetup,Forward_gpu, Backward_gpu是虛函數，能夠根據須要進行重寫。
Setup, Forward, BackWard是普通函數，不要重寫；

因爲卷積也有許多種，因此在中間加了BaseConvolutionLayer類，作爲全部卷積類的基類。實現了以下函數，並將Reshape函數由純虛函數變爲了虛函數：

LayerSetUp(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
Reshape(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
forward_cpu_gemm(const Dtype* input, const Dtype* weights, Dtype* output, bool skip_im2col)
forward_cpu_bias(Dtype* output, const Dtype* bias)
backward_cpu_gemm(const Dtype* output, const Dtype* weights, Dtype* input)
weight_cpu_gemm(const Dtype* input, const Dtype* output, Dtype* weights)
backward_cpu_bias(Dtype* bias, const Dtype* input)
forward_gpu_gemm(const Dtype* input, const Dtype* weights, Dtype* output, bool skip_im2col)
forward_gpu_bias(Dtype* output, const Dtype* bias)
backward_gpu_gemm(const Dtype* output, const Dtype* weights, Dtype* input)
weight_gpu_gemm(const Dtype* input, const Dtype* output, Dtype* weights)
backward_gpu_bias(Dtype* bias, const Dtype* input)

ConvolutionLayer繼承BaseConvolutionLayer，實現了以下函數：

Forward_cpu(const vector<Blob<Dtype>*>& bottom, const vector<Blob<Dtype>*>& top)
Backward_cpu(const vector<Blob<Dtype>*>& top, const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom)

在Layer的Forward函數中，首先調用Reshape函數，這時調用的是BaseConvolutionLayer::Reshape函數，caffe的數據組織類型爲Blob，在輸入(bottom)大小已知，卷積參數已知的狀況下，是能夠計算輸出(top)的Blob的shape，以下：

// Shape the tops.
bottom_shape_ = &bottom[0]->shape();
compute_output_shape();
vector<int> top_shape(bottom[0]->shape().begin(),
  bottom[0]->shape().begin() + channel_axis_);
top_shape.push_back(num_output_);
for (int i = 0; i < num_spatial_axes_; ++i) {
  top_shape.push_back(output_shape_[i]);
}
for (int top_id = 0; top_id < top.size(); ++top_id) {
  top[top_id]->Reshape(top_shape);
}

裏面對每一個輸出top[i]調用了其成員函數Reshape，Blob的Reshape函數以下：

template <typename Dtype>
void Blob<Dtype>::Reshape(const vector<int>& shape) {
  CHECK_LE(shape.size(), kMaxBlobAxes);
  count_ = 1;
  shape_.resize(shape.size());
  if (!shape_data_ || shape_data_->size() < shape.size() * sizeof(int)) {
    shape_data_.reset(new SyncedMemory(shape.size() * sizeof(int)));
  }
  int* shape_data = static_cast<int*>(shape_data_->mutable_cpu_data());
  for (int i = 0; i < shape.size(); ++i) {
    CHECK_GE(shape[i], 0);
    if (count_ != 0) {
      CHECK_LE(shape[i], INT_MAX / count_) << "blob size exceeds INT_MAX";
    }
    count_ *= shape[i];
    shape_[i] = shape[i];  //拷到內部
    shape_data[i] = shape[i];
  }
  if (count_ > capacity_) {	//內存不夠
    capacity_ = count_;
    data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));  //從新申請
    diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
  }
}

其實就是將傳入的shape複製到Blob的內部變量shape_中，並判斷內存是否知足要求，不知足要求的話從新申請內存。

前向傳播這裏咱們分析cpu的狀況，Reshape以後是Forward_cpu，如今調用的是ConvolutionLayer::Forward_cpu函數：

template <typename Dtype>
void ConvolutionLayer<Dtype>::Forward_cpu(const vector<Blob<Dtype>*>& bottom,
      const vector<Blob<Dtype>*>& top) {
  const Dtype* weight = this->blobs_[0]->cpu_data();
  for (int i = 0; i < bottom.size(); ++i) {
    const Dtype* bottom_data = bottom[i]->cpu_data();
    Dtype* top_data = top[i]->mutable_cpu_data();
    for (int n = 0; n < this->num_; ++n) {
      this->forward_cpu_gemm(bottom_data + n * this->bottom_dim_, weight,
          top_data + n * this->top_dim_);
      if (this->bias_term_) {
        const Dtype* bias = this->blobs_[1]->cpu_data();
        this->forward_cpu_bias(top_data + n * this->top_dim_, bias);
      }
    }
  }
}

代碼主要是對於每一個bottom、top，要作num_(batch_size)次矩陣乘法(forward_cpu_gemm)，將bottom_data與weight相乘，結果保存到top_data中，這裏mutable_cpu_data表示要對這個地址進行寫數據，具體地矩陣乘法：

template <typename Dtype>
void BaseConvolutionLayer<Dtype>::forward_cpu_gemm(const Dtype* input,
    const Dtype* weights, Dtype* output, bool skip_im2col) {
  const Dtype* col_buff = input;
  if (!is_1x1_) {
    if (!skip_im2col) {
      conv_im2col_cpu(input, col_buffer_.mutable_cpu_data());
    }
    col_buff = col_buffer_.cpu_data();
  }
  for (int g = 0; g < group_; ++g) {
    caffe_cpu_gemm<Dtype>(CblasNoTrans, CblasNoTrans, conv_out_channels_ /
        group_, conv_out_spatial_dim_, kernel_dim_,
        (Dtype)1., weights + weight_offset_ * g, col_buff + col_offset_ * g,
        (Dtype)0., output + output_offset_ * g);
  }
}

這裏有conv_im2col_cpu函數。若是咱們不進行轉換，咱們須要循環進行屢次矩陣乘法，這裏使用這個函數將每一個patch(kxkxC)拉直，而後將這些patch堆在一塊兒，這樣就能夠只進行一次卷積就能夠求出全部結果，caffe_cpu_gemm就是封裝的cblas的矩陣乘法ouput = weights * col_buff。

im2col

再回到Forward函數中，作完Forward_cpu後，會遍歷全部層判斷是不是loss層，若是是則根據cpu_diff()計算loss：

inline Dtype Layer<Dtype>::Forward(const vector<Blob<Dtype>*>& bottom,
    const vector<Blob<Dtype>*>& top) {
  Dtype loss = 0;
  Reshape(bottom, top);
  Forward_cpu(bottom, top);
  for (int top_id = 0; top_id < top.size(); ++top_id) {
    if (!this->loss(top_id)) { continue; }
    const int count = top[top_id]->count();
    const Dtype* data = top[top_id]->cpu_data();
    const Dtype* loss_weights = top[top_id]->cpu_diff();
    loss += caffe_cpu_dot(count, data, loss_weights);
  }
}

這樣Forward函數就結束了，下面開始進入Backward函數，直接來看Layer的Backward函數，以下：

template <typename Dtype>
inline void Layer<Dtype>::Backward(const vector<Blob<Dtype>*>& top,
    const vector<bool>& propagate_down,
    const vector<Blob<Dtype>*>& bottom) {
  switch (Caffe::mode()) {
  case Caffe::CPU:
    Backward_cpu(top, propagate_down, bottom);
    break;
  case Caffe::GPU:
    Backward_gpu(top, propagate_down, bottom);
    break;
  default:
    LOG(FATAL) << "Unknown caffe mode.";
  }
}

裏面直接調用Backward_cpu函數，來看ConvolutionLayer的Backward_cpu函數，以下：

template <typename Dtype>
void ConvolutionLayer<Dtype>::Backward_cpu(const vector<Blob<Dtype>*>& top,
      const vector<bool>& propagate_down, const vector<Blob<Dtype>*>& bottom) {
  const Dtype* weight = this->blobs_[0]->cpu_data();
  Dtype* weight_diff = this->blobs_[0]->mutable_cpu_diff();
  for (int i = 0; i < top.size(); ++i) {
    const Dtype* top_diff = top[i]->cpu_diff();
    const Dtype* bottom_data = bottom[i]->cpu_data();
    Dtype* bottom_diff = bottom[i]->mutable_cpu_diff();
    // Bias gradient, if necessary.
    if (this->bias_term_ && this->param_propagate_down_[1]) {
      Dtype* bias_diff = this->blobs_[1]->mutable_cpu_diff();
      for (int n = 0; n < this->num_; ++n) {
        this->backward_cpu_bias(bias_diff, top_diff + n * this->top_dim_);
      }
    }
    if (this->param_propagate_down_[0] || propagate_down[i]) {
      for (int n = 0; n < this->num_; ++n) {
        // gradient w.r.t. weight. Note that we will accumulate diffs.
        if (this->param_propagate_down_[0]) {
          this->weight_cpu_gemm(bottom_data + n * this->bottom_dim_,
              top_diff + n * this->top_dim_, weight_diff);
        }
        // gradient w.r.t. bottom data, if necessary.
        if (propagate_down[i]) {
          this->backward_cpu_gemm(top_diff + n * this->top_dim_, weight,
              bottom_diff + n * this->bottom_dim_);
        }
      }
    }
  }
}

裏面根據top_diff分別更新了當前層的weight_diff(weight_cpu_gemm)，和bottom_diff(backward_cpu_gemm)（計算bottom_diff其實是爲了weight_diff）。

那麼Backward也結束了，它分別計算了各層的權重參數的梯度(weight_diff)、以及各層blob的梯度(bottom_diff)。

再回到solver.Solver函數中，發現下面是執行ApplyUpdate()函數，纔是真正更新參數的時候，solver.ApplyUpdate()實際上調用了Net.Update()函數，以下：

template <typename Dtype>
void Net<Dtype>::Update() {
  for (int i = 0; i < learnable_params_.size(); ++i) {
    learnable_params_[i]->Update();
  }
}

這裏的learnable_params_實際上就是每層可訓練的參數，也就是每層的權重參數Blob，咱們以前更新了這些Blob裏的diff值，那咱們再繼續看看Blob.Update()函數裏作了什麼：

void Blob<Dtype>::Update() {
  // We will perform update based on where the data is located.
  switch (data_->head()) {
  case SyncedMemory::HEAD_AT_CPU:
    // perform computation on CPU
    caffe_axpy<Dtype>(count_, Dtype(-1),
        static_cast<const Dtype*>(diff_->cpu_data()),
        static_cast<Dtype*>(data_->mutable_cpu_data()));
    break;
    //...
  }
}

主要是作了以下的計算data_ = data_ - diff_，caffe_axpy其實是封裝了cblas的函數，主要作兩個函數相加，因爲傳入的係數是Dtype(-1)，因此是進行了相減更新data_，至此，每層的權重參數都獲得了更新，那麼一次迭代更新也就結束了。下面就是屢次調用這個過程，直到訓練獲得一個較好的權重參數。

test階段

測試test階段，不須要solver，直接使用Net進行Forward就能夠獲得結果：

Net<float> caffe_net(FLAGS_model, caffe::TEST, FLAGS_level, &stages);
const vector<Blob<float>*>& result = caffe_net.Forward(&iter_loss);

pycaffe

首先有一個_caffe.cpp文件，裏面將全部caffe框架編譯成一個_caffe.so，而pycaffe.py至關於一個wrapper，封裝了一些python接口。pycaffe中能夠將_caffe.so中的對象import進來，看成python對象使用，以下：

from ._caffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, \
        RMSPropSolver, AdaDeltaSolver, AdamSolver, NCCL, Timer

之因此能夠導入直接使用，這是由於在_caffe.cpp中使用BOOST_PYTHON_MODULE進行了導出：

BOOST_PYTHON_MODULE(_caffe) {
...
}

以下是導出一個類的方法：

#include<string>
#include<boost/python.hpp>

using namespace std;
using namespace boost::python;

struct World {
    void set(string msg) { this->msg = msg; }
    string greet() { return msg; }

    string msg;
};

BOOST_PYTHON_MODULE(hello) //導出的module 名字
{
    class_<World>("World")
        .def("greet", &World::greet)
        .def("set", &World::set);
}

以下是python中調用導出的方法：

import hello 
planet = hello.World() # 調用默認構造函數，產生類對象
planet.set("howdy")   # 調用對象的方法
print planet.greet() # 調用對象的方法

若是不想導出任何構造函數，則使用no_init:

class_<Abstract>("Abstract",no_init)

最後，caffe目錄中提供了一個__init__.py文件，將整個caffe目錄變成一個python包：

from .pycaffe import Net, SGDSolver, NesterovSolver, AdaGradSolver, RMSPropSolver, AdaDeltaSolver, AdamSolver, NCCL, Timer
from ._caffe import init_log, log, set_mode_cpu, set_mode_gpu, set_device, Layer, get_solver, layer_type_list, set_random_seed, solver_count, set_solver_count, solver_rank, set_solver_rank, set_multiprocess, has_nccl
from ._caffe import __version__
from .proto.caffe_pb2 import TRAIN, TEST
from .classifier import Classifier
from .detector import Detector
from . import io
from .net_spec import layers, params, NetSpec, to_proto