Protocol Buffers（1）：序列化、編譯與使用

時間 2019-11-07

原文原文鏈接

目錄java

Protocol Buffers docs：https://developers.google.com/protocol-buffers/docs/overview
github：https://github.com/protocolbuffers/protobufios

序列化與反序列化

有些時候，咱們但願給數據結構或對象拍個「快照」，或者保存成文件，或者傳輸給其餘應用程序。好比，在神經網絡訓練過程當中，咱們會將不一樣階段的網絡權重以模型文件的形式保存下來，若是訓練意外終止，能夠從新載入模型文件將模型復原，繼續訓練。git

將數據結構或對象以某種格式轉化爲字節流的過程，稱之爲序列化（Serialization），目的是把當前的狀態保存下來，在須要時復原數據結構或對象（序列化時不包含與對象相關聯的函數，因此後面只提數據結構）。反序列化（Deserialization），是序列化的逆過程，讀取字節流，根據約定的格式協議，將數據結構復原。以下圖所示，圖片來自geeksforgeeksgithub

在介紹具體技術以前，咱們先在腦海裏分析下序列化和反序列化的過程：json

代碼運行過程當中，數據結構和對象位於內存，其中的各項數據成員可能彼此緊鄰，也可能分佈在並不連續的各個內存區域，好比指針指向的內存塊等；
文件中字節是順序存儲的，要想將數據結構保存成文件，就須要把全部的數據成員平鋪開（flatten），而後串接在一塊兒；
直接串接多是不行的，由於字節流中沒有自然的分界，因此在序列化時須要按照某種約定的格式（協議），以便在反序列化時知道「從哪裏到哪裏是哪一個數據成員」，所以格式可能須要約定：指代數據成員的標識、起始位置、終止位置、長度、分隔符等
由上可見，格式協議是最重要的，它直接決定了序列化和反序列化的效率、字節流的大小和可讀性等

Protocol Buffers概覽

本文的主角Protocol Buffers，簡稱Protobuf，是谷歌開源的一項序列化技術，用官方語言介紹就是：bash

What are protocol buffers?
Protocol buffers are Google's language-neutral, platform-neutral, extensible mechanism for serializing structured data – think XML, but smaller, faster, and simpler.
You define how you want your data to be structured once, then you can use special generated source code to easily write and read your structured data to and from a variety of data streams and using a variety of languages.網絡

跨語言，跨平臺，相比XML和JSON 更小、更快、更容易，由於XML、JSON爲了可閱讀、自解釋被設計成字符文本形式，因此體積更大，在編碼解碼上也更麻煩，而Protobuf序列化爲binary stream，體積更小，可是喪失了可讀性——後面咱們將看到可讀性能夠經過另外一種方式獲得保證。至於上面的"You define how you want your data to be structured once"該怎麼理解？參看下圖，圖片素材來自 Protocol Buffers官網首頁。數據結構

首先是proto文件，在其中定義咱們想要序列化的數據結構，如上圖中的message Person，經過Protobuf提供的protoc.exe生成編解碼代碼文件（C++語言是.cc和.h），其中定義了類Person，類的各個成員變量與proto文件中的定義保持一致。序列化時，定義Person對象，對其成員變量賦值，調用序列化成員函數，將對象保存到文件。反序列化時，讀入文件，將Person對象復原，讀取相應的數據成員。jvm

proto文件僅定義了數據的結構（name、id、email），具體的數據內容（123四、"John Doe"、"jdoe@example.com"）保存在序列化生成的文件中，經過簡單的思考可知，序列化後的文件裏應該會存在一些輔助信息用來將數據內容與數據結構對應起來，以便在反序列化時將數據內容賦值給對應的成員。

流程以下：

對Protobuf有了大體的瞭解後，咱們來看看如何編譯和使用Protobuf。

Protocol Buffers C++ 編譯

在 github release 下載對應版本的源碼，參見 cmake/README.md查看如何經過源碼編譯，筆者使用的是VS2015，經過以下指令編譯：

# 源碼位於protobuf-3.7.1目錄，cd protobuf-3.7.1/cmake
mkdir build
cd build
mkdir solution
cd solution
cmake -G "Visual Studio 14 2015 Win64" -DCMAKE_INSTALL_PREFIX=../../../../install ../.. -Dprotobuf_BUILD_TESTS=OFF

運行上面指令，會在solution目錄下生成vs解決方案，編譯整個解決方案，其中的INSTALL工程會生成install文件夾（位於protobuf-3.7.1/../install），內含3個子文件夾：

bin - that contains protobuf protoc.exe compiler;

include - that contains C++ headers and protobuf *.proto files;

lib - that contains linking libraries and CMake configuration files for protobuf package.

經過上面3個文件夾，咱們就能夠完成序列化和反序列化工做。

Protocol Buffers C++ 使用

下面經過一個例子說明怎麼使用Protobuf。

新建proto文件example.proto，添加內容以下：

package example;

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

每一個filed的格式：
required/optional/repeated FieldType FieldName = FieldNumber（a unique number in current message）

Field Numbers are used to identify your fields in the message binary format.

required: a well-formed message must have exactly one of this field.

optional: a well-formed message can have zero or one of this field (but not more than one).

repeated: this field can be repeated any number of times (including zero) in a well-formed message. The order of the repeated values will be preserved.

將example.proto文件複製到bin目錄，運行以下指令：

protoc.exe example.proto --cpp_out=./

--cpp_out指定了生成cpp代碼文件的目錄，也可經過--java_out、--python_out等來指定其餘語言代碼生成的目錄。上面指令會在當前目錄下生成example.pb.cc和example.pb.h兩個文件，其中命名空間example下定義了Person類，該類繼承自public ::google::protobuf::Message，Person的數據成員含有name_、id_、email_，以及對應的set、has等成員函數。

接下來，在vs中新建一個測試工程，

將include目錄添加到附加包含目錄，
將lib目錄添加到附加庫目錄，將lib文件添加到附加依賴項，
將生成example.pb.cc 和 example.pb.h也添加到工程，
新建main.cpp，#include "example.pb.h"

添加以下內容：

#include "example.pb.h"

int main()
{
    // Set data
    example::Person msg;
    msg.set_id(1234);
    msg.set_name("John Doe");
    msg.set_email("jdoe@example.com");

    // Serialization
    fstream output("./Person.bin", ios::out | ios::binary);
    msg.SerializePartialToOstream(&output);
    output.close();

    // Deserialization
    example::Person msg1;
    fstream input("./Person.bin", ios::in | ios::binary);
    msg1.ParseFromIstream(&input);
    input.close();

    // Get data
    cout << msg1.id() << endl; // 1234
    cout << msg1.name() << endl; // John Doe
    cout << msg1.email() << endl; // jdoe@example.com

    return 0;
}

上面代碼將對象保存到Person.bin文件，在反序列化恢復對象。Person.bin文件內容以下：

仍是能看出一些規律的，字符串前1個字節表示的整數與字符串的長度相同，這是偶然嗎？若是字符串很長，好比600個字符，超出1個字節能表示的範圍怎麼辦？其餘字節又是什麼含義？

這些問題，好比關於Protobuf是如何編碼的，以及生成的cc和h文件代碼細節，留到後面的文章介紹。

Protocol Buffers的可讀性

二進制文件雖然體積更小，但其可讀性無疑是差的，XML和JSON的優點之一就是可讀性，可讀意味着可編輯、可人工校驗，Protobuf是否是就不能作到了呢？

並非的，讓咱們繼續在main函數中添加以下代碼：

#include "google/protobuf/io/zero_copy_stream_impl.h"

int main()
{
    // ……
    
    // Serialization to text file
    fw.open("./Person.txt", ios::out | ios::binary);
    google::protobuf::io::OstreamOutputStream *output = new google::protobuf::io::OstreamOutputStream(&fw);
    google::protobuf::TextFormat::Print(msg, output);
    delete output;
    fw.close();

    // Deserialization from text file
    example::Person msg2;
    fr.open("./Person.txt", ios::in | ios::binary);
    google::protobuf::io::IstreamInputStream input(&fr);
    google::protobuf::TextFormat::Parse(&input, &msg2);
    fr.close();

    // Get data
    cout << msg2.id() << endl; // 1234
    cout << msg2.name() << endl; // John Doe
    cout << msg2.email() << endl; // jdoe@example.com
}

這段代碼是將對象保存成文本文件，再復原。打開文件Person.txt，其內容以下：

name: "John Doe"
id: 1234
email: "jdoe@example.com"

和JSON是否是很像，也是相似的key-value對。

有了文本文件咱們就能夠直接閱讀、校驗和修改序列化後的數據，而且自如地在二進制文件和文本文件間轉換，好比修改文本文件、恢復成對象、再導出二進制文件。

相信經過這篇文章，你已經對Protocol Buffer有了初步的瞭解，後續文章將深刻介紹Protobuf的編解碼和源碼細節。

以上。

參考

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。