金蝶隨手記團隊分享：還在用JSON? Protobuf讓數據傳輸更省更快(原理篇)

時間 2019-12-08

標籤隨手團隊分享還在 json protobuf 數據傳輸更快原理欄目 JavaScript 简体版

原文原文鏈接

本文做者：丁同舟，轉載自「隨手記技術團隊」微信公衆號。php

一、前言

跟移動端IM中追求數據傳輸效率、網絡流量消耗等需求同樣，隨手記客戶端與服務端交互的過程當中，對部分數據的傳輸大小和效率也有較高的要求，普通的數據格式如 JSON 或者 XML 已經不能知足，所以決定採用 Google 推出的 Protocol Buffers 以達到數據高效傳輸。html

（本文同步發佈於：http://www.52im.net/thread-1510-1-1.html）

緩存

二、系列文章

本文是系列文章中的第1篇，總目錄以下：

微信

《金蝶隨手記團隊分享：還在用JSON? Protobuf讓數據傳輸更省更快(原理篇)》（本文）
《金蝶隨手記團隊分享：還在用JSON? Protobuf讓數據傳輸更省更快(實踐篇)》

另外，若是您還打算系統地瞭解IM的開發知識，能夠閱讀《新手入門一篇就夠：從零開發移動端IM》。網絡

三、參考資料

《Protobuf通訊協議詳解：代碼演示、詳細原理介紹等》
《一個基於Protocol Buffer的Java代碼演示》
《如何選擇即時通信應用的數據傳輸格式》
《強列建議將Protobuf做爲你的即時通信應用數據傳輸格式》
《全方位評測：Protobuf性能到底有沒有比JSON快5倍？》
《移動端IM開發須要面對的技術問題（含通訊協議選擇）》
《簡述移動端IM開發的那些坑：架構設計、通訊協議和客戶端》
《理論聯繫實際：一套典型的IM通訊協議設計詳解》
《詳解如何在NodeJS中使用Google的Protobuf》
>> 更多同類文章 ……數據結構

四、Protubuf介紹

Protocol buffers 爲 Google 提出的一種跨平臺、多語言支持且開源的序列化數據格式。相對於相似的 XML 和 JSON，Protocol buffers 更爲小巧、快速和簡單。其語法目前分爲proto2和proto3兩種格式。

相對於傳統的 XML 和 JSON, Protocol buffers 的優點主要在於：更加小、更加快。

對於自定義的數據結構，Protobuf 能夠經過生成器生成不一樣語言的源代碼文件，讀寫操做都很是方便。

假設如今有下面 JSON 格式的數據:架構

{
      "id":1,
      "name":"jojo",
      "email":"123@qq.com",
}函數

使用 JSON 進行編碼，得出byte長度爲43的的二進制數據：性能

7b226964 223a312c 226e616d 65223a22 6a6f6a6f 222c2265 6d61696c 223a2231 32334071 712e636f 6d227dui

若是使用 Protobuf 進行編碼，獲得的二進制數據僅有20個字節：

0a046a6f 6a6f1001 1a0a3132 33407171 2e636f6d

五、編碼原理

相對於基於純文本的數據結構如 JSON、XML等，Protobuf 可以達到小巧、快速的最大緣由在於其獨特的編碼方式。《Protobuf通訊協議詳解：代碼演示、詳細原理介紹等》對 Protobuf 的 Encoding 做了很好的解析。

例如，對於int32類型的數字，若是很小的話，protubuf 由於採用了Varint方式，能夠只用 1 個字節表示。

六、Varint原理

Varint 中每一個字節的最高位 bit 表示此 byte 是否爲最後一個 byte 。1 表示後續的 byte 也表示該數字，0 表示此 byte 爲結束的 byte。

例如數字 300 用 Varint 表示爲 1010 1100 0000 0010：
<ignore_js_op>
圖片源自《Protobuf通訊協議詳解：代碼演示、詳細原理介紹等》

注意：
須要注意解析的時候會首先將兩個 byte 位置互換，由於字節序採用了 little-endian 方式。

但 Varint 方式對於帶符號數的編碼效果比較差。由於帶符號數一般在最高位表示符號，那麼使用 Varint 表示一個帶符號數不管大小就必需要 5 個 byte（最高位的符號位沒法忽略，所以對於 -1 的 Varint 表示就變成了 010001）。

Protobuf 引入了 ZigZag 編碼很好地解決了這個問題。

七、ZigZag編碼

<ignore_js_op>

關於 ZigZag 的編碼方式，博客園上的一篇博文《整數壓縮編碼 ZigZag》作出了詳細的解釋。

<ignore_js_op>

ZigZag 編碼按照數字的絕對值進行升序排序，將整數經過一個 hash 函數h(n) = (n<<1)^(n>>31)（若是是 sint64 h(n) = (n<<1)^(n>>63)）轉換爲遞增的 32 位 bit 流。

關於爲何 64 的 ZigZag 爲 80 01，《整數壓縮編碼 ZigZag》中有關於其編碼惟一可譯性的解釋。

經過 ZigZag 編碼，只要絕對值小的數字，均可以用較少位的 byte 表示。解決了負數的 Varint 位數會比較長的問題。

八、T-V and T-L-V

Protobuf 的消息結構是一系列序列化後的Tag-Value對。其中 Tag 由數據的 field 和 writetype組成，Value 爲源數據編碼後的二進制數據。

假設有這樣一個消息:

message Person {
int32 id = 1;
string name = 2;
}

其中，id字段的field爲1，writetype爲int32類型對應的序號。編碼後id對應的 Tag 爲 (field_number << 3) | wire_type = 0000 1000，其中低位的 3 位標識 writetype，其餘位標識field。

每種類型的序號能夠從這張表獲得:
<ignore_js_op>

須要注意，對於string類型的數據（在上表中第三行），因爲其長度是不定的，因此 T-V的消息結構是不能知足的，須要增長一個標識長度的Length字段，即T-L-V結構。

九、Protobuf的反射機制

Protobuf 自己具備很強的反射機制，能夠經過 type name 構造具體的 Message 對象。陳碩的文章《一種自動反射消息類型的 Google Protobuf 網絡傳輸方案》中對 GPB 的反射機制作了詳細的分析和源碼解讀。這裏經過 protobuf-objectivec 版本的源碼，分析此版本的反射機制。

<ignore_js_op>

陳碩對 protobuf 的類結構作出了詳細的分析 —— 其反射機制的關鍵類爲Descriptor類：

每一個具體 Message Type 對應一個 Descriptor 對象。儘管咱們沒有直接調用它的函數，可是Descriptor在「根據 type name 建立具體類型的 Message 對象」中扮演了重要的角色，起了橋樑做用。

同時，陳碩根據 GPB 的 C++ 版本源代碼分析出其反射的具體機制：DescriptorPool類根據 type name 拿到一個 Descriptor的對象指針，在經過MessageFactory工廠類根據Descriptor實例構造出具體的Message對象。

示例代碼以下：

 
         Message* createMessage( 
         const 
         std::string& typeName) 
        
         { 
        
         Message* message = NULL; 
        
         const 
         Descriptor* descriptor = DescriptorPool::generated_pool()->FindMessageTypeByName(typeName); 
        
         if 
         (descriptor) 
        
         { 
        
         const 
         Message* prototype = MessageFactory::generated_factory()->GetPrototype(descriptor); 
        
         if 
         (prototype) 
        
         { 
        
         message = prototype->New(); 
        
         } 
        
         } 
        
         return 
         message; 
        
         }

注意：

DescriptorPool 包含了程序編譯的時候所連接的所有 protobuf Message types
MessageFactory 能建立程序編譯的時候所連接的所有 protobuf Message types

十、以Protobuf-objectivec爲例

在 OC 環境下，假設有一份 Message 數據結構以下：

 
         message Person { 
        
         string name = 1; 
        
         int32 id = 2; 
        
         string email = 3; 
        
         }

解碼此類型消息的二進制數據：

 
         Person *newP = [[Person alloc] initWithData:data error: 
         nil 
         ];

這裏調用了：

 
         - (instancetype)initWithData:( 
         NSData 
         *)data error:( 
         NSError 
         **)errorPtr { 
        
         return 
         [ 
         self 
         initWithData:data extensionRegistry: 
         nil 
         error:errorPtr]; 
        
         }

其內部調用了另外一個構造器：

 
         - (instancetype)initWithData:(NSData *)data 
        
         extensionRegistry:(GPBExtensionRegistry *)extensionRegistry 
        
         error 
         :(NSError **)errorPtr { 
        
         if 
         ((self = [self init])) { 
        
         @try { 
        
         [self mergeFromData:data extensionRegistry:extensionRegistry]; 
        
         //... 
        
         } 
        
         @catch (NSException *exception) { 
        
         //...   
        
         } 
        
         } 
        
         return self; 
        
         }

去掉一些防護代碼和錯誤處理後，能夠看到最終由mergeFromData:方法實現構造：

 
         - ( 
         void 
         )mergeFromData:( 
         NSData 
         *)data extensionRegistry:(GPBExtensionRegistry *)extensionRegistry { 
        
         GPBCodedInputStream *input = [[GPBCodedInputStream alloc] initWithData:data];  
         //根據傳入的`data`構造出數據流對象 
        
         [ 
         self 
         mergeFromCodedInputStream:input extensionRegistry:extensionRegistry];  
         //經過數據流對象進行merge 
        
         [input checkLastTagWas:0];  
         //校檢 
        
         [input release]; 
        
         }

這個方法主要作了兩件事：

1）經過傳入的 data 構造GPBCodedInputStream對象實例；
2）經過上面構造的數據流對象進行 merge 操做。

GPBCodedInputStream負責的工做很簡單，主要是把源數據緩存起來，並同時保存一系列的狀態信息，例如size, lastTag等。

其數據結構很是簡單：

 
         typedef 
         struct 
         GPBCodedInputStreamState { 
        
         const 
         uint8_t *bytes; 
        
         size_t bufferSize; 
        
         size_t bufferPos; 
        
         // For parsing subsections of an input stream you can put a hard limit on 
        
         // how much should be read. Normally the limit is the end of the stream, 
        
         // but you can adjust it to anywhere, and if you hit it you will be at the 
        
         // end of the stream, until you adjust the limit. 
        
         size_t currentLimit; 
        
         int32_t lastTag; 
        
         NSUInteger 
         recursionDepth; 
        
         } GPBCodedInputStreamState; 
        
         @interface 
         GPBCodedInputStream () { 
        
         @package 
        
         struct 
         GPBCodedInputStreamState state_; 
        
         NSData 
         *buffer_; 
        
         }

merge 操做內部實現比較複雜，首先會拿到一個當前 Message 對象的 Descriptor 實例，這個 Descriptor 實例主要保存 Message 的源文件 Descriptor 和每一個 field 的 Descriptor，而後經過循環的方式對 Message 的每一個 field 進行賦值。

Descriptor 簡化定義以下：

 
    
     
       
       
         @interface 
         GPBDescriptor :  
         NSObject 
         < 
         NSCopying 
         > 
        
 
         @property 
         ( 
         nonatomic 
         ,  
         readonly 
         , strong, nullable)  
         NSArray 
         <GPBFieldDescriptor*> *fields; 
        
 
         @property 
         ( 
         nonatomic 
         ,  
         readonly 
         , strong, nullable)  
         NSArray 
         <GPBOneofDescriptor*> *oneofs;  
         //用於 repeated 類型的 filed 
        
 
         @property 
         ( 
         nonatomic 
         ,  
         readonly 
         , assign) GPBFileDescriptor *file; 
        
 
         @end 
        
 
     
 
    
  

其中GPBFieldDescriptor定義以下：

 
         @interface 
         GPBFieldDescriptor () { 
        
         @package 
        
         GPBMessageFieldDescription *description_; 
        
         GPB_UNSAFE_UNRETAINED GPBOneofDescriptor *containingOneof_; 
        
         SEL 
         getSel_; 
        
         SEL 
         setSel_; 
        
         SEL 
         hasOrCountSel_;   
         // *Count for map<>/repeated fields, has* otherwise. 
        
         SEL 
         setHasSel_; 
        
         }

其中GPBMessageFieldDescription保存了 field 的各類信息，如數據類型、filed 類型、filed id等。除此以外，getSel和setSel爲這個 field 在對應類的屬性的 setter 和 getter 方法。

mergeFromCodedInputStream:方法的簡化版實現以下：

 
         - ( 
         void 
         )mergeFromCodedInputStream:(GPBCodedInputStream *)input 
        
         extensionRegistry:(GPBExtensionRegistry *)extensionRegistry { 
        
         GPBDescriptor *descriptor = [ 
         self 
         descriptor];  
         //生成當前 Message 的`Descriptor`實例 
        
         GPBFileSyntax syntax = descriptor.file.syntax;  
         //syntax 標識.proto文件的語法版本 (proto2/proto3) 
        
         NSUInteger 
         startingIndex = 0;  
         //當前位置 
        
         NSArray 
         *fields = descriptor->fields_;  
         //當前 Message 的全部 fileds 
        
         //循環解碼 
        
         for 
         ( 
         NSUInteger 
         i = 0; i < fields.count; ++i) { 
        
         //拿到當前位置的`FieldDescriptor` 
        
         GPBFieldDescriptor *fieldDescriptor = fields[startingIndex]; 
        
         //判斷當前field的類型 
        
         GPBFieldType fieldType = fieldDescriptor.fieldType; 
        
         if 
         (fieldType == GPBFieldTypeSingle) { 
        
         //`MergeSingleFieldFromCodedInputStream` 函數中解碼 Single 類型的 field 的數據 
        
         MergeSingleFieldFromCodedInputStream( 
         self 
         , fieldDescriptor, syntax, input, extensionRegistry); 
        
         //當前位置+1 
        
         startingIndex += 1;  
        
         }  
         else 
         if 
         (fieldType == GPBFieldTypeRepeated) { 
        
         // ... 
        
         // Repeated 解碼操做 
        
         }  
         else 
         {   
        
         // ... 
        
         // 其餘類型解碼操做 
        
         } 
        
         }   
         // for(i < numFields) 
        
         }

能夠看到，descriptor在這裏是直接經過 Message 對象中的方法拿到的，而不是經過工廠構造：

 
         GPBDescriptor *descriptor = [ 
         self 
         descriptor]; 
        
         //`desciptor`方法定義 
        
         - (GPBDescriptor *)descriptor { 
        
         return 
         [[ 
         self 
         class 
         ] descriptor];  
        
         }

這裏的descriptor類方法其實是由GPBMessage的子類具體實現的。

例如在Person這個消息結構中，其descriptor方法定義以下：

 
         + (GPBDescriptor *)descriptor { 
        
         static 
         GPBDescriptor *descriptor =  
         nil 
         ; 
        
         if 
         (!descriptor) { 
        
         static 
         GPBMessageFieldDescription fields[] = { 
        
         { 
        
         .name =  
         "name" 
         , 
        
         .dataTypeSpecific.className =  
         NULL 
         , 
        
         .number = Person_FieldNumber_Name, 
        
         .hasIndex = 0, 
        
         .offset = (uint32_t)offsetof(Person__storage_, name), 
        
         .flags = GPBFieldOptional, 
        
         .dataType = GPBDataTypeString, 
        
         }, 
        
         //... 
        
         //每一個field都會在這裏定義出`GPBMessageFieldDescription` 
        
         }; 
        
         GPBDescriptor *localDescriptor =  
         //這裏會根據fileds和其餘一系列參數構造出一個`Descriptor`對象 
        
         descriptor = localDescriptor; 
        
         } 
        
         return 
         descriptor; 
        
         }

接下來，在構造出 Message 的 Descriptor 後，會對全部的 fields 進行遍歷解碼。解碼時會根據不一樣的fieldType調用不一樣的解碼函數。

例如對於fieldType == GPBFieldTypeSingle，會調用 Single 類型的解碼函數:

 
         MergeSingleFieldFromCodedInputStream( 
         self 
         , fieldDescriptor, syntax, input, extensionRegistry); 
        
         MergeSingleFieldFromCodedInputStream內部提供了一系列宏定義，針對不一樣的數據類型進行數據解碼。 
        
         #define CASE_SINGLE_POD(NAME, TYPE, FUNC_TYPE)                             \ 
        
         case 
         GPBDataType##NAME: {                                              \ 
        
         TYPE val = GPBCodedInputStreamRead##NAME(&input->state_);            \ 
        
         GPBSet##FUNC_TYPE##IvarWithFieldInternal( 
         self 
         , field, val, syntax);  \ 
        
         break 
         ;                                                               \ 
        
         } 
        
         #define CASE_SINGLE_OBJECT(NAME)                                           \ 
        
         case 
         GPBDataType##NAME: {                                              \ 
        
         id 
         val = GPBCodedInputStreamReadRetained##NAME(&input->state_);      \ 
        
         GPBSetRetainedObjectIvarWithFieldInternal( 
         self 
         , field, val, syntax); \ 
        
         break 
         ;                                                               \ 
        
         } 
        
         CASE_SINGLE_POD(Int32, int32_t, Int32) 
        
         ... 
        
         #undef CASE_SINGLE_POD 
        
         #undef CASE_SINGLE_OBJECT

例如對於int32類型的數據，最終會調用int32_t GPBCodedInputStreamReadInt32(GPBCodedInputStreamState *state);函數讀取數據並賦值。

這裏內部實現其實就是對於 Varint 編碼的解碼操做：

 
         int32_t GPBCodedInputStreamReadInt32(GPBCodedInputStreamState *state) { 
        
         int32_t value = ReadRawVarint32(state); 
        
         return 
         value; 
        
         }

在對數據解碼完成後，拿到一個int32_t，此時會調用GPBSetInt32IvarWithFieldInternal進行賦值操做，其簡化實現以下：

 
         void 
         GPBSetInt32IvarWithFieldInternal(GPBMessage * 
         self 
         , 
        
         GPBFieldDescriptor *field, 
        
         int32_t value, 
        
         GPBFileSyntax syntax) { 
        
         //最終的賦值操做 
        
         //此處`self`爲`GPBMessage`實例 
        
         uint8_t *storage = (uint8_t *) 
         self 
         ->messageStorage_; 
        
         int32_t *typePtr = (int32_t *)&storage[field->description_->offset]; 
        
         *typePtr = value; 
        
         }