從零開始山寨Caffe·玖：BlobFlow

時間 2019-11-06

標籤開始山寨 caffe blobflow 简体版

原文原文鏈接

據說Google出了TensorFlow，那麼Caffe應該叫什麼？git

　　　　　　　　　　　　　　　　　　　　　　　　　　——BlobFlowgithub

神經網絡時代的傳播數據結構

個人代碼

我最先手寫神經網絡的時候，Flow結構是這樣的：算法

struct Data
{
    vector<double> feature;
    int y;
    Data(vector<double> feature,int y):feature(feature),y(y) {}
};
vector<double> u_i,v_i,u_j,v_j;

很簡陋的結構，主要功能就是利用vector存一下每層正向傳播的值。編程

Word2Vec

後來我看了Google的Mikolov大神的Word2Vec的源碼，它的Flow結構是這樣的:數組

real *neu1 = (real *)calloc(doc->layer1_size, sizeof(real));

而後我吐槽了一下，這功能不是比我還弱麼，vector起碼還能提供STL的基礎功能。網絡

(注：Word2Vec源碼是以CPU多線程和內存操做快而著稱的，簡陋但速度快）數據結構

Theano

再後來，我學習了Theano，它的Flow結構是這樣的:多線程

input=theano.tensor.matrix('x')
class DataLayer(object):
    def __init__(self,input,batch_size,size):
        self.batch_size=batch_size
        self.size=size
        self.input=input
        self.params=None
    def get_output(self):
        output=self.input
        if type(self.size) is tuple: #Mode: 2D
            output=output.reshape((self.batch_size,self.size[2],self.size[1],self.size[0]))
        else: #Mode: 1D
            output=output.reshape((self.batch_size,self.size))
        return output

Bengio組模仿物理學的張量(Tensor)的概念，建立了Theano的Tensor系統。app

Dim爲0的叫常量，Dim爲1的叫向量，Dim=2的叫矩陣，Dim>2就沒名字了，且Dim能夠無限擴大。框架

Tensor的出現，很好地規避了機器學習研究者不會寫代碼的問題（好比上節出現的簡陋結構）。

同時，隨着mini-batch、conv等方法在深度學習中的大規模使用，咱們的Flow結構顯然須要多維化。

因爲是操做多維空間，常常須要維度切換，reshape函數天然成了Tensor的核心函數。

(reshape的概念最先應該來自Python的科學計算庫numpy，Theano的Tensor系統，很大程度上在重寫numpy）

TensorFlow

再後來，Google把Andrew Ng開發的一代深度學習框架DistBelief給換掉了，第二代叫TensorFlow。

按照官方的說法，取名TensorFlow(2015)的緣由是由於系統裏主要是Tensor在Flow。

推測一下DistBelief(2011)和Theano(NIPS2012)的公佈時間，咱們大概推測，DistBelief的Flow結構估計至關Low。

按照Caffe(2013)做者賈大神的說法，他參與了TensorFlow的主體開發。

因此，TensorFlow裏的Tensor結構，不難看出來，是借鑑了Theano(2012)和Caffe(2013)的綜合體。

符號系統

儘管Caffe(2013)具備相似Tensor的Blob結構，可是和Theano(2012)、TensorFlow(2015)的Tensor相比，

仍是比較弱的。核心緣由是，Tensor的出發點是創建在符號系統上的，而Caffe(2013)只是最暴力的執行代碼。

按照MXNet的陳天奇大神在MS研究院內部的講座說法：

Caffe(2013)屬於Imperative Programme(命令程序)

Theano(2012)、TensorFlow(2015)、MXNet(2015)屬於Declaretive Programme(聲明程序)

符號系統須要內建一套數學式語法解析結構，針對原始的命令語句作一個深度的Wrapper，從白盒變成黑盒。

其難度和代碼量仍是有的。與之相比，Blob讀起來，仍是要比Tensor要簡單地多的。

淺析Blob設計原理

存儲性質

不管是正向傳播的輸出，仍是反向傳播的殘差，仍是神經元參數，這些都須要不一樣的結構去存儲。

Blob廣義上極力規避設計多種結構的問題，這點上是參考Tensor的。

你能夠自由規劃1D、2D、3D、4D甚至nD的多維數組存儲空間，這種存儲具備至關不錯的靈活性。

功能性質

不幸的是，操做多維數組在編程中是件麻煩事。

樸素C語言提供的多維數組，功能很弱，好比你想獲知大小(size)就是一件難事。

使用STL是一個不錯的注意，嵌套STL，從數據結構角度就變成了廣義表。

儘管廣義表的功能較樸素C語言多維數組要多，不過看起來也不盡如人意。

——————————————————————————————————————————————————

另外，最惱人的是CUDA不推薦GPU操做多維數組，最多能夠申請到3維數組的顯存優化。

若是不使用CUDA提供的多維數組內存對齊優化，那麼IO指令取址將會很是頻繁，致使IO速度嚴重退化。

從內存角度理解，顯然線性內存空間訪問便捷，nD內存空間就十分糟糕了。

——————————————————————————————————————————————————

從SyncedMemory的設計中，幾乎就能夠推測，Caffe爲了速度，徹底使用線性內存/顯存。

於是，爲使線性內存模擬出nD內存，就須要在內存訪問上作點偏移(Offset)計算。

Blob的大部分功能，即是擴展線性SyncedMemory的邏輯功能，使其變成邏輯上的多維數組。

張量·軸設計

在早期神經網絡編程中，一般採用的是1D空間，每一個樣本擁有一個輸入向量。

上個世紀末，LeCun等人倡導在SGD中，替代單樣本爲mini-batch，才使得軸設計得以派上用場。

axis=0用於batch_size，batch中每一個樣本的向量移到axis=1。

這種空間在今天的神經網絡NLP（NNNLP）任務中，仍然是主要採用的。

上個世紀90年代初，LeCun將Fukushima的神經機結合導師Hinton的BP算法，演化成能夠訓練的CNN，使得軸進一步擴展。

CNN所擴展的軸，稱之爲空間軸(spatial axes)，放置於axis=2,....以後。

原神經網絡的axis=1軸，結合圖像文件的通道(channels)概念、CNN的特徵圖概念，被替換成channels axis。

這樣，在Blob中，就構成了使用最頻繁的4軸空間(batch_size，channels，height，width）。

在Caffe中，batch_size用num替代，這個名字理解起來更泛性一點。

各軸都具備必定的軸長，描述軸空間須要shape功能，軸空間變形則須要reshape功能。

代碼實戰

從Blob開始，爲了便於閱讀，代碼將在不一樣章逐步擴展，如下僅提供適用於本章的精簡代碼。

完整代碼見本章最後的Github連接。

創建blob.hpp

數據結構

template <typename Dtype>
class Blob{
public:
    Blob():data_(),diff_(),count_(0), capacity_(0) {}
    Blob(const vector<int>& shape) :count_(0),capacity_(0) { reshape(shape); }
    void reshape(int num, int channels, int height, int width);
    void reshape(vector<int> shape);
    void reshape(const BlobShape& blob_shape);
    void reshapeLike(const Blob& blob);
    const Dtype* cpu_data() const;
    const Dtype *gpu_data() const;
    const Dtype* cpu_diff() const;
    const Dtype* gpu_diff() const;
    Dtype *mutable_cpu_data();
    Dtype *mutable_gpu_data();
    Dtype *mutable_cpu_diff();
    Dtype *mutable_gpu_diff();
    int num() const { return shape(0); }
    int channels() const { return shape(1); }
    int height() const { return shape(2); }
    int width() const { return shape(3); }
    int count() const{ return count_; }
    int count(int start_axis, int end_axis) const {
        CHECK_GE(start_axis, 0);
        CHECK_LE(start_axis, end_axis);
        CHECK_LE(start_axis, num_axes());
        CHECK_LE(end_axis, num_axes());
        int cnt = 1;
        for (int i = start_axis; i < end_axis; i++) cnt *= shape(i);
        return cnt;
    }
    int count(int start_axis) const{ return count(start_axis, num_axes()); }
    const vector<int> &shape() const{ return shape_; }
    int shape(int axis) const{ return shape_[canonicalAxisIndex(axis)]; }
    int offset(const int n, const int c = 0, const int h = 0,
        const int w = 0){
        CHECK_GE(n, 0);
        CHECK_LE(n, num());
        CHECK_GE(channels(), 0);
        CHECK_LE(c, channels());
        CHECK_GE(height(), 0);
        CHECK_LE(h, height());
        CHECK_GE(width(), 0);
        CHECK_LE(w, width());
        return ((n * channels() + c) * height() + h) * width() + w;
    }
    int num_axes() const { return shape_.size(); }
    // idx ranges [-axes,axes)
    // idx(-1) means the last axis
    int canonicalAxisIndex(int axis) const{
        CHECK_GE(axis, -num_axes());
        CHECK_LT(axis, num_axes());
        if (axis < 0) return axis + num_axes();
        else return axis;
    }
    const boost::shared_ptr<SyncedMemory>& data() const { return data_; }
    const boost::shared_ptr<SyncedMemory>& diff() const { return diff_; }
    //    change the shared_ptr object and will recycle the memory if need
    void shareData(const Blob& blob) {
        CHECK_EQ(count(), blob.count());
        data_ = blob.data(); 
    }
    void shareDiff(const Blob& blob) {
        CHECK_EQ(count(), blob.count());
        diff_ = blob.diff();
    }void FromProto(const BlobProto& proto, bool need_reshape = true);
    void ToProto(BlobProto* proto, bool write_diff = false);
protected:
    boost::shared_ptr<SyncedMemory> data_, diff_;
    vector<int> shape_;
    int count_, capacity_;
};

先說說幾個成員變量：

count、capacity用於reshape中的計算，前者是新reshape的大小，後者是歷史reshape大小。

Blob的任何構造函數中，必定要將這個兩個值置0，不然reshape會失敗。

線性內存空間以shared_ptr綁定，所以Blob不須要析構函數，Blob銷燬後，指針空間會被自動回收。

默認有2個線性內存空間，data、diff，分別用於存儲數據/殘差。

vector<int> shape用於存各個軸的軸長。

——————————————————————————————————————————————————

而後看軸相關函數：

num、channels、height、width、count、shape都是簡單的封裝，注意設成常成員函數。

因爲Blob會做爲const引用的參數，好比sharedData/shareDiff，這些訪問接口必須保證this指針一致。

這點在第壹章時，略微提醒過。

count和shape都是重載函數，提供不一樣的訪問方式。

軸訪問canonicalAxisIndex函數上，借鑑了Python的負軸訪問方式，若是你沒有Python的習慣，能夠寫簡單點。

——————————————————————————————————————————————————

對SyncedMemory的封裝，主要目的是將void*型內存轉換爲計算類型的內存。

void*型內存以數組下標方式訪問時，每一個單元佔用8Bit（1字節），這種單元內存是不能直接使用的。

由於一個int/float單元佔用32Bit(4字節)，一個double單元佔用64Bit(8字節)。

C/C++經過對數組首元素指針的強制轉換，能夠改變下標索引的單元訪問模式。

——————————————————————————————————————————————————

reshape函數看起來重載了不少，實際上主體設在 void reshape(vector<int> shape)裏。

其它都是簡單的封裝。

——————————————————————————————————————————————————

offset函數是很是重要的，它目的是計算相對偏移量，造成邏輯上的多維空間結構。

在DataLayer中，由Datum組織Blob一個例子以下：

for (int i = 0; i < batch_size; i++){
    // must refer use '&' to keep data vaild(!!!important)
    Datum &datum = *(reader.full().pop("Waiting for Datum data"));
    int offset = batch->data.offset(i);
    //    share a part of a blob memory 
    transformed_data.set_cpu_data(base_data + offset);
    //    transform datum and copy its value to the part of blob memory
    if (has_labels) base_label[i] = datum.label();
    ptr_transformer->transform(datum, &transformed_data);
    //let the reader to read new datum
    reader.free().push(&datum);
}

在這裏，對batch裏的每個樣本，每次偏移channels*height*width個單位，馬上跳轉到下一張圖的首元素。

更通常的，令base_data+=data.offset(0,1)，就跳轉到了下一個channel的首元素。

因爲線性空間是連續的，這種偏移僅僅須要加法器一次運算，就能模擬出多維空間，十分廉價。

——————————————————————————————————————————————————

兩個share函數用於直接替換掉data_,diff_，因爲使用了shared_ptr，SyncedMemory會自動釋放。

當神經網絡須要交叉驗證時，從訓練網絡copy參數到測試網絡是沒有必要的。

此時，只要將訓練網絡的所有參數Blob，一一對應share給測試網絡便可。

——————————————————————————————————————————————————

FromProto和ToProto用於反序列化/序列化至protobuff格式。

惟一用處是對神經網絡的參數Blob進行snapshot(截圖)，以便繼續訓練或者離線測試。

實現

給出幾個比較重要的實現。

template<typename Dtype>
void Blob<Dtype>::reshape(vector<int> shape){
    count_ = 1;
    shape_.resize(shape.size());
    for (int i = 0; i < shape.size(); ++i) {
        count_ *= shape[i];
        shape_[i] = shape[i];
    }
    if (count_ > capacity_) {
        capacity_ = count_;
        data_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
        diff_.reset(new SyncedMemory(capacity_ * sizeof(Dtype)));
    }
}

能夠看到，reshape爲SyncedMemory準備了capacity*sizeof(Dtype)個字節單元。

同時，你須要回憶一下，SyncedMemory(size)並不會馬上啓動狀態轉移自動機申請內存/顯存。

只有執行Blob:: cpu_data/gpu_data/mutable_cpu_data/mutable_gpu_data，纔會申請。

這有點像函數式編程裏的Lazy思想，胡亂寫Blob其實問題不大，只要該Blob沒有使用，就不會有內存空間損耗。

template<typename Dtype>
void Blob<Dtype>::ToProto(BlobProto* proto, bool write_diff){
    proto->clear_shape();
    proto->clear_data();
    proto->clear_diff();
    //do not use proto->shape() cause it is a const method
    for (int i = 0; i < shape_.size(); i++)  proto->mutable_shape()->add_dim(shape_[i]);
    const Dtype *data = cpu_data();
    const Dtype *diff = cpu_diff();
    for (int i = 0; i < count_; i++)  proto->add_data(data[i]);
    if (write_diff)
        for (int i = 0; i < count_; i++)  proto->add_diff(diff[i]);
}