Protocol Buffers（2）：編碼與解碼

時間 2019-11-07

原文原文鏈接

目錄html

Message Structure

在上一篇文章中咱們提到，對於序列化後字節流，須要回答的一個重要問題是「從哪裏到哪裏是哪一個數據成員」。app

message中每個field的格式爲：
required/optional/repeated FieldType FieldName = FieldNumber（a unique number in current message）
在序列化時，一個field對應一個key-value對，整個二進制文件就是一連串緊密排列的key-value對，key也稱爲tag，先上圖直觀感覺一下，圖片來自Encoding and Evolution：
函數

key由wire type和FieldNumber兩部分編碼而成，具體地key = (field_number << 3) | wire_type，field_number 部分指示了當前是哪一個數據成員，經過它將cc和h文件中的數據成員與當前的key-value對應起來。ui

key的最低3個bit爲wire type，什麼是wire type？以下表所示：this

wire type被如此設計，主要是爲了解決一個問題，如何知道接下來value部分的長度（字節數），若是google

wire type = 0、一、5，編碼爲 key + 數據，只有一個數據，可能佔數個字節，數據在編碼時自帶終止標記
wire type = 2，編碼爲 key + length + 數據，length指示了數據長度，可能有多個數據，順序排在length後

解碼代碼一窺

接下來，咱們直接看一下example.pb.cc及相關的源碼，看下key-value對是如何解析的。解碼過程相對簡單，理解了解碼過程，編碼也就比較顯然了。編碼

// example.proto
package example;

message Person {
  required string name = 1;
  required int32 id = 2;
  optional string email = 3;
}

// in example.pb.cc
bool Person::MergePartialFromCodedStream(
    ::google::protobuf::io::CodedInputStream* input) {
#define DO_(EXPRESSION) if (!PROTOBUF_PREDICT_TRUE(EXPRESSION)) goto failure
  ::google::protobuf::uint32 tag;
  // @@protoc_insertion_point(parse_start:example.Person)
  for (;;) {
    ::std::pair<::google::protobuf::uint32, bool> p = input->ReadTagWithCutoffNoLastTag(127u);
    tag = p.first;
    if (!p.second) goto handle_unusual;
    switch (::google::protobuf::internal::WireFormatLite::GetTagFieldNumber(tag)) {
      // required string name = 1;
      case 1: {
        if (static_cast< ::google::protobuf::uint8>(tag) == (10 & 0xFF)) { // 10 = (1 << 3) + 2
          DO_(::google::protobuf::internal::WireFormatLite::ReadString(
                input, this->mutable_name()));
          ::google::protobuf::internal::WireFormat::VerifyUTF8StringNamedField(
            this->name().data(), static_cast<int>(this->name().length()),
            ::google::protobuf::internal::WireFormat::PARSE,
            "example.Person.name");
        } else {
          goto handle_unusual;
        }
        break;
      }

      // required int32 id = 2;
      case 2: {
        if (static_cast< ::google::protobuf::uint8>(tag) == (16 & 0xFF)) { // 16 = (2 << 8) + 0
          HasBitSetters::set_has_id(this);
          DO_((::google::protobuf::internal::WireFormatLite::ReadPrimitive<
                   ::google::protobuf::int32, ::google::protobuf::internal::WireFormatLite::TYPE_INT32>(
                 input, &id_)));
        } else {
          goto handle_unusual;
        }
        break;
      }

      // optional string email = 3;
      case 3: {
        if (static_cast< ::google::protobuf::uint8>(tag) == (26 & 0xFF)) {
          DO_(::google::protobuf::internal::WireFormatLite::ReadString(
                input, this->mutable_email()));
          ::google::protobuf::internal::WireFormat::VerifyUTF8StringNamedField(
            this->email().data(), static_cast<int>(this->email().length()),
            ::google::protobuf::internal::WireFormat::PARSE,
            "example.Person.email");
        } else {
          goto handle_unusual;
        }
        break;
      }

      default: {
      handle_unusual:
        if (tag == 0) {
          goto success;
        }
        DO_(::google::protobuf::internal::WireFormat::SkipField(
              input, tag, _internal_metadata_.mutable_unknown_fields()));
        break;
      }
    }
  }
success:
  // @@protoc_insertion_point(parse_success:example.Person)
  return true;
failure:
  // @@protoc_insertion_point(parse_failure:example.Person)
  return false;
#undef DO_
}

整段代碼在循環地解析input流，遇到1個tag（key），根據其wire type和數據類型調用相應的解析函數，若是是string，則調用ReadString，ReadString會一直調用到ReadBytesToString，若是是int32，則調用ReadPrimitive，ReadPrimitive中會進一步調用ReadVarint32。能夠看到，生成的example.pb.cc決定了遇到哪一個tag調用哪一個解析函數，從輸入流中解析出值，賦給對應的成員變量，而真正進行解析的代碼其實是Protobuf的源碼，以下所示：.net

// in wire_format_lit.cc
inline static bool ReadBytesToString(io::CodedInputStream* input,
                                     string* value) {
  uint32 length;
  return input->ReadVarint32(&length) &&
      input->InternalReadStringInline(value, length);
}

// in wire_format_lit.h
template <>
inline bool WireFormatLite::ReadPrimitive<int32, WireFormatLite::TYPE_INT32>(
    io::CodedInputStream* input,
    int32* value) {
  uint32 temp;
  if (!input->ReadVarint32(&temp)) return false;
  *value = static_cast<int32>(temp);
  return true;
}

// in coded_stream.h
inline bool CodedInputStream::ReadVarint32(uint32* value) {
  uint32 v = 0;
  if (PROTOBUF_PREDICT_TRUE(buffer_ < buffer_end_)) {
    v = *buffer_;
    if (v < 0x80) {
      *value = v;
      Advance(1);
      return true;
    }
  }
  int64 result = ReadVarint32Fallback(v);
  *value = static_cast<uint32>(result);
  return result >= 0;
}

能夠看到，若是遇到int32的tag，直接讀取接下來的數據，若是遇到string的tag，會先讀一個Varint32的length，而後再讀length個字節的數據。設計

這裏頻繁出現了varint，length是varint，存儲的int32數據也是varint，那varint是什麼？

varint

varint是一種可變長編碼，使用1個或多個字節對整數進行編碼，可編碼任意大的整數，小整數佔用的字節少，大整數佔用的字節多，若是小整數更頻繁出現，則經過varint可實現壓縮存儲。

varint中每一個字節的最高位bit稱之爲most significant bit (MSB)，若是該bit爲0意味着這個字節爲表示當前整數的最後一個字節，若是爲1則表示後面還有至少1個字節，可見，varint的終止位置實際上是自解釋的。

在Protobuf中，tag和length都是使用varint編碼的。length和tag中的field_number都是正整數int32，這裏提一下tag，它的低3位bit爲wire type，若是隻用1個字節表示的話，最高位bit爲0，則留給field_number只有4個bit位，1到15，若是field_number大於等於16，就須要用2個字節，因此對於頻繁使用的field其field_number應設置爲1到15。

好比正整數150，其使用varint編碼以下（小端存儲）：

// proto file
message Test1 {
  optional int32 a = 1;
}

// c++ file
// set a = 150

// binary file, in hex
// 08 96 01

其中08爲key， 96 01爲150的varint編碼，解釋以下

有關varint的更多內容，能夠參見wiki Variable-length quantity。

至此，key-value的編碼方式咱們已經解決了一半，還剩value部分沒有解決，接下來看看Protobuf數據部分是如何編碼的。

Protobuf中的整數和浮點數

Protobuf中整數也是經過varint進行編碼，移除每一個字節的MSB，而後拼接在一塊兒，能夠獲得一個含有數個字節的buffer，這個buffer該怎麼解釋還須要參考具體的數據類型。

對於int32或int64，正數直接按varint編碼，數據類型爲int32或int64的負數統一被編碼爲10個字節長的varint（補碼）。

若是是sint32或sint64，則採用ZigZag方式進行編碼，以下表所示：

sint32 n被編碼爲 (n << 1) ^ (n >> 31)對應的varint，sint64 n被編碼爲 (n << 1) ^ (n >> 63)對應的varint，這樣，絕對值較小的整數只須要較少的字節就能夠表示。

至於浮點數，對應的wire type爲1或5，直接按小端存儲。

Length-delimited相關類型

主要有3類：string、嵌套message以及packed repeated fields。它們的編碼方式統一爲 tag + length + 數據，只是數據部分有所差別。

string的編碼爲 key + length + 字符，參看開篇的圖片已經很清晰了。

嵌套message也很簡單，直接將嵌套message部分的編碼接在length後便可，以下所示：

// proto file
message Test1 {
  optional int32 a = 1;
}
message Test3 {
  optional Test1 c = 3;
}

// cpp file
// set a = 150

// message Test3 binary file, in hex
// 1a 03 08 96 01

其中，1a爲c的key，03爲c的長度，接下來的08 96 01爲a的key+value。

packed repeated fields，指的是proto2中聲明瞭[packed=true]的repeated varint、32bit or 64bit數據，proto3中repeated默認packed，以下所示

// in proto2
message Test4 {
  repeated int32 d = 4 [packed=true];
}

// in proto3
message Test4 {
  repeated int32 d = 4;
}

// 3, 270, 86942壓縮存儲以下，in hex
22        // key (field number 4, wire type 2), 0x22 = 34 = (4 << 3) + 2
06        // payload size (6 bytes), length
03        // first element (varint 3)
8E 02     // second element (varint 270)
9E A7 05  // third element (varint 86942)

6個字節根據varint的MSB可自動分割成3個數據。對這種packed repeated fields，在Protobuf中會以RepeatedField對象承載，支持get-by-index、set-by-index和add（添加元素）操做。