在Erlang中寫處理二進制數據的代碼是洋溢着幸福感的,它對於二進制強大的表現力甚至能讓你忘掉了它種種不便,今天咱們說說Erlang的二進制數據處理。
Erlang中bit string表明無類型的內存區域,bit string 使用比特語法表達,若是bit string包含的數據位數是8的整數倍被稱爲二進制Binary數據.不要小看一個bits或者說bit string,可讓咱們指定任意數據位的bit,這在協議解析的時候是多麼的遍歷,能夠設想一下假如沒有這樣一個基礎設施,咱們解析二進制協議的時候將是一種什麼樣的狀況(Erlang以前好像就沒有提供bits,還好,還好,等咱們用的時候已經有了).html
咱們先從erlang shell中編寫一系列demo開始,注意=號先後輸入一下空格避免語法錯誤:git
1> Bin1 = <<1,17,42>>. %嘗試一下<<M:8,P:8,Q:8>> = <<1,17,42>>.
<<1,17,42>>
2> Bin2 = <<"abc">>. %嘗試一下 <<"ABC">> == <<"A","B","C">>.
<<97,98,99>> % 嘗試一下 <<"abc"/utf8>> == <<$a/utf8,$b/utf8,$c/utf8>>.
3> Bin3 = <<1,17,42:16>>. %%42小於(2的8次方-1)因此只佔8位就夠了
<<1,17,0,42>>
4> <<A,B,C:16>> = <<1,17,42:16>>.
<<1,17,0,42>>
5> C.
42
6> <<D:16,E,F>> = <<1,17,42:16>>. % 256*1+17=273
<<1,17,0,42>>
7> D.
273
8> F.
42
9> <<G,H/binary>> = <<1,17,42:16>>.
<<1,17,0,42>>
10> H.
<<17,0,42>>
%%釋放變量 f().
11> <<G,H/bitstring>> = <<1,17,42:12>>.
<<1,17,1,10:4>>
12> H.
<<17,1,10:4>>
13> <<1024/utf8>>.
<<208,128>>
14> << P,Q/bitstring >> = <<1:1,12:7,3:3>>.
<<140,3:3>>
15> << 1:1,0:3>>.
<<8:4>>
16> <<B1/binary,B2/binary>> = << 8,16>>.
* 1: a binary field without size is only allowed at the end of a binary pattern
這裏產生異常是由於咱們沒有給B1指定長度:
In matching, this default value is only valid for the very last element. All other bit string or binary elements in the matching must have a size specification.
咱們修改代碼重試:
27> <<B1:2/binary,B2/binary>> = << 8,16>>.
<<8,16>>
28> B1.
<<8,16>>
29> B2.
<<>>
30> <<B3:1/binary,B4/binary>> = << 8,16>>.
<<8,16>>
31> B3.
<<"\b">>
32> B4.
<<16>>
咱們甚至能夠嘗試一下Bit String Comprehensions,對比列表解析,這是不難理解:
33> << <<(X*2)>> || <<X>> <= <<1,2,3>> >>.
<<2,4,6>>
上面提到了bits,這個在網絡協議以及文件格式解析的時候很常見,好比mochi的erl_image項目, https://github.com/mochi/erl_img/blob/master/src/image_gif.erlgithub
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
|
read(Fd,IMG,RowFun,St0) ->
file:position(Fd, 6),
case
file:read(Fd, 7) of
{ok, <<_Width:16/little, _Hight:16/little,
Map:1, _Cr:3, Sort:1, Pix:3,
Background:8,
AspectRatio:8>>
} ->
Palette = read_palette(Fd, Map, Pix+1),
?dbg(
"sizeof(palette)=~p Map=~w, Cr=~w, Sort=~w, Pix=~w\n"
,
[length(Palette),Map,_Cr,Sort,Pix]),
?dbg(
"Background=~w, AspectRatio=~w\n"
,
[Background, AspectRatio]),
As = [{
'Background'
,Background},
{
'AspectRatio'
,AspectRatio},
{
'Sort'
,Sort} | IMG#erl_image.attributes],
IMG1 = IMG#erl_image { palette = Palette, attributes = As},
read_data(Fd, IMG1, RowFun, St0, []);
Error ->
Error
end.
|
寫完這些demo,咱們對Erlang的強大的二進制數據表現能力有了一個基本的瞭解,下面是比特語法的規格說明,這裏面仍是有一些基礎的問題須要明確一下:算法
<<>>
<<E1,...,En>>
Ei = Value |
Value:Size |
Value/TypeSpecifierList |
Value:Size/TypeSpecifierList
Type= integer | float | binary | bytes | bitstring | bits | utf8 | utf16 | utf32
Signedness= signed | unsigned (整型值時有意義,默認是unsigned)
Endianness= big | little | native 默認是big
Unit= unit:IntegerLiteral
unit是每一個數據段的size,容許的取值範圍是1..256
size和unit的乘積是數據佔用的二進制位數,且必須能夠被8整除
unit 一般用來保證字節對齊.shell
Type類型默認值是整型. bytes是binary簡寫形式;bits是bitstring的簡寫形式.編程
類型說明二進制數據如何使用,數據的使用方式決定了數據的意義網絡
Unit取值範圍是1..256, integer float bitstring默認值是1,binary默認值是8. utf8, utf16, and utf32不須要指定類型規格說明.
Size指定數據段的單位(unit).默認值和類型有關:整型-8位 浮點型-64位
Endianness默認值是big;大端小端只有是Type是整形,utf16,utf32,float的時候有意義.這個決定了二進制數據如何讀取.還有一種是選擇native選項,這個是運行時根據CPU的狀況選擇大端仍是小端.app
數據區所佔用bit位數的計算方法: Size * unit =bit 位數less
1
2
3
4
5
6
|
<<25:4/unit:8>> .
<<0,0,0,25>>
7> <<25:2/unit:16>> .
<<0,0,0,25>>
8> <<25:1/unit:32>> .
<<0,0,0,25>>
|
TypeSpecifierList類型規格說明列表使用中橫線鏈接(-).任何省略類型規格說明的都會使用默認值.
對於上面的規格說明,會有兩個問題:
【問題1】Type裏面提到了utf8 utf16 utf32,翻閱文檔能夠看到下面的說明,這些說明怎麼理解?
For the utf8, utf16, and utf32 types, Size must not be given. The size of the segment is implicitly determined by the type and value itself.
For utf8, Value will be encoded in 1 through 4 bytes. For utf16, Value will be encoded in 2 or 4bytes. Finally, for utf32, Value will always be encoded in 4 bytes.
【問題2】endianness 是什麼?什麼是 big-endian?什麼是little-endian?
Native-endian means that the endianness will be resolved at load time to be either big-endian or little-endian,
depending on what is native for the CPU that the Erlang machine is run on. Endianness only matters when the Type is either integer, utf16, utf32, or float. The default is big.
第一個問題:
仍是從頭理一下這個問題吧,感謝維基百科詳盡的講解:
一個字節byte有8位表達256種狀態,咱們最熟悉的ASCII(American Standard Code for Information Interchange)是最通用的單字節編碼系統規定了128個字符和二進制數據位之間的關係,因爲只須要一個字節的後面7位就能夠實現因此最前面一位爲0;並無。在Erlang中咱們能夠經過$符號取字符的ASCII值.維基百科ASCII: http://zh.wikipedia.org/wiki/Ascii
從維基百科中的描述能夠看到ASCII的侷限性,它的表達能力僅限於現代英語,對於其它語言表達能力是顯然不足的。首先被想到的就是利用默認的最高位來表達更多符號,這樣作的結果就是0-127表示的符號是相同的,128~256根據語言不一樣表示的符號不一樣。這種簡單的擴展方案被稱爲EASCII,它勉強能夠表達西歐語言。EASCII的故事看這裏:http://zh.wikipedia.org/wiki/EASCII
看到EASCII方案的時候很容易想到中文字符的表達,中文字符數量巨大,一個字節顯然是無能爲力的表達的。中文編碼咱們最熟悉的是GB2312編碼,它使用兩個字節表達了6763個漢字覆蓋了中國大陸99.75%的使用字,人名和古漢語中的罕見字以及繁體字是沒有覆蓋到的,這也就催生了GBK GB18030編碼。記得咱們大學同窗裏有一個叫孟龑的,「龑」字就不在GB2312'名錄中。GB2312的八卦看這裏:http://zh.wikipedia.org/wiki/Gb2312
一樣的二進制數據按照不一樣的編碼規範去解析得出的符號結果也是不一樣的,若是使用錯誤的編碼方式去解讀就會出現亂碼,一個理想的解決方案就是採用一種統一的編碼規範。Unicode(統一碼、萬國碼、單一碼、標準萬國碼)是計算機科學領域裏的一項業界標準,用以統一地體現和處理世界上大部分的文字系統,併爲其編碼。讀過前面的資料,這裏可能產生一個疑問:Unicode會使用幾個字節表示字符呢?Unicode只規定符號的編碼,不規定如何表達。一個字符的Unicode編碼是肯定的。可是在實際傳輸過程當中,因爲不一樣系統平臺的設計不必定一致,以及出於節省空間的目的,對Unicode編碼的實現方式有所不一樣。Unicode的實現方式稱爲Unicode轉換格式(Unicode Transformation Format,簡稱爲UTF),UTF-8就是轉換格式之一。
UTF-8採用變長字節存儲Unicode.若是一個僅包含基本7位ASCII字符的Unicode文件,若是每一個字符都使用2字節的原Unicode編碼傳輸,其第一字節的8位始終爲0。這就形成了比較大的浪費。對於這種狀況,可使用UTF-8編碼,這是一種變長編碼,它將基本7位ASCII字符仍用7位編碼表示,佔用一個字節(首位補0)。而遇到與其餘Unicode字符混合的狀況,將按必定算法轉換,每一個字符使用1-3個字節編碼,並利用首位爲0或1進行識別。這樣對以7位ASCII字符爲主的西文文檔就大大節省了編碼長度(具體方案參見UTF-8)。Unicode 編碼:http://zh.wikipedia.org/wiki/Unicode UTF-8編碼:http://zh.wikipedia.org/wiki/UTF-8
這裏第一問題的答案已經有了,因爲是變長編碼,類型和值決定了佔用的字節數。順便提一下曾經遇到過的文件生成時BOM的問題:
什麼是BOM?字節順序記號(英語:byte-order mark,BOM)是位於碼點U+FEFF的統一碼字符的名稱。當以UTF-16或UTF-32來將UCS/統一碼字符所組成的字符串編碼時,這個字符被用來標示其字節序。它常被用來當作標示文件是以UTF-八、UTF-16或UTF-32編碼的記號。這裏有篇文章已經討論C#解決BOM的問題(如何讀,寫,去掉BOM):http://www.cnblogs.com/mgen/archive/2011/07/13/2105649.html
第二個問題:
維基百科上關於Endianness的資料:http://zh.wikipedia.org/wiki/%E5%AD%97%E8%8A%82%E5%BA%8F 比較有意思的是它的詞源來自於格列佛遊記。小說中,小人國爲水煮蛋該從大的一端(Big-End)剝開仍是小的一端(Little-End)剝開而爭論,爭論的雙方分別被稱爲Big-endians和Little-endians。1980年,Danny Cohen在其著名的論文"On Holy Wars and a Plea for Peace"中,爲平息一場關於字節該以什麼樣的順序傳送的爭論,而引用了該詞。
Endianness字節序,又稱端序,尾序。在計算機科學領域中,字節序是指存放多字節數據的字節(byte)的順序,典型的狀況是整數在內存中的存放方式和網絡傳輸的傳輸順序。Endianness有時候也能夠用指位序(bit)。
通常而言,字節序指示了一個UCS-2字符的哪一個字節存儲在低地址。若是LSByte在MSByte的前面,即LSB爲低地址,則該字節序是小端序;反之則是大端序。在網絡編程中,字節序是一個必須被考慮的因素,由於不一樣的處理器體系可能採用不一樣的字節序。在多平臺的代碼編程中,字節序可能會致使難以察覺的bug。網絡傳輸通常採用大端序,也被稱之爲網絡字節序,或網絡序。IP協議中定義大端序爲網絡字節序。
回到Erlang文檔中的那段文字咱們就能夠理解了:網絡字節序採用big-endians,erlang比特語法字節序默認值也是big,也就是說在作網絡協議實現的時候咱們不須要顯示指定該選項。Native-endian的意思是運行時決定字節序。
字節序的問題是一個公共問題,看老趙的這篇文章:淺談字節序(Byte Order)及其相關操做
最後貼一段Erlang羣裏面常問到的一個問題,如何解析從二進制數據中解析字符串:ide
read_string(Bin) ->
case Bin of
<<Len:16, Bin1/binary>> ->
case Bin1 of
<<Str:Len/binary-unit:8, Rest/binary>> ->
{binary_to_list(Str), Rest};
_R1 ->
{[],<<>>}
end;
_R1 ->
{[],<<>>}
end.
二進制的內部實現
- binary和bitstring內部實現機制相同
- Erlang內部有四種二進制類型,兩種容器,兩種引用
- 容器有refc binaries 和 heap binaries
- refc binaries又能夠分紅兩部分存放在進程堆(process heap)的ProcBin,ProcBin是一個二進制數據的元數據信息,包含了二進制數據的位置和引用計數,進程堆之外的二進制對象
- 遊離在進程堆以外的二進制對象能夠被任意數量的進程和任意數量的ProcBin引用,該對象包含了引用計數器,一旦計數器歸零就能夠移除掉
- 全部的ProcBin對象都是鏈表的一部分,因此GC跟蹤它們並在ProcBin消失的時候將應用計數減一
- heap binaries 都是小塊二進制數據,最大64字節,直接存放在進程堆(process heap),垃圾回收和發送消息都是經過拷貝實現,不須要垃圾回收器作特殊處理
- 引用類型有兩種:sub binaries , match contexts
- sub binary是split_binary的時候產生的,sub binary是另一個二進制數據的部分應用(refc 或者 heap binary),因爲並無數據拷貝因此binary的模式匹配成本至關低
- match context相似sub binary,可是針對二進制匹配作了優化;例如它包含一個直接指向二進制數據的指針.從二進制匹配出來字段值以後移動指針位置便可.
官方文檔連接:http://www.erlang.org/doc/efficiency_guide/binaryhandling.html
Internally, binaries and bitstrings are implemented in the same way.
There are four types of binary objects internally. Two of them are containers for binary data and two of them are merely references to a part of a binary.
The binary containers are called refc binaries (short for reference-counted binaries) and heap binaries.Refc binaries consist of two parts: an object stored on the process heap, called a ProcBin, and the binary object itself stored outside all process heaps.The binary object can be referenced by any number of ProcBins from any number of processes; the object contains a reference counter to keep track of the number of references, so that it can be removed when the last reference disappears.
All ProcBin objects in a process are part of a linked list, so that the garbage collector can keep track of them and decrement the reference counters in the binary when a ProcBin disappears.
Heap binaries are small binaries, up to 64 bytes, that are stored directly on the process heap. They will be copied when the process is garbage collected and when they are sent as a message. They don't require any special handling by the garbage collector.
There are two types of reference objects that can reference part of a refc binary or heap binary. They are called sub binaries and match contexts.
A sub binary is created by split_binary/2 and when a binary is matched out in a binary pattern. A sub binary is a reference into a part of another binary (refc or heap binary, never into a another sub binary). Therefore, matching out a binary is relatively cheap because the actual binary data is never copied.
A match context is similar to a sub binary, but is optimized for binary matching; for instance, it contains a direct pointer to the binary data. For each field that is matched out of a binary, the position in the match context will be incremented.
Endianness
Possible values: big | little | native
Endianness only matters when the Type is either integer, utf16, utf32, or float. This has to do with how the system reads binary data. As an example, the BMP image header format holds the size of its file as an integer stored on 4 bytes. For a file that has a size of 72 bytes, a little-endian system would represent this as <<72,0,0,0>> and a big-endian one as <<0,0,0,72>>. One will be read as '72' while the other will be read as '1207959552', so make sure you use the right endianness. There is also the option to use 'native', which will choose at run-time if the CPU uses little-endianness or big-endianness natively. By default, endianness is set to 'big'.
Unit
written unit:Integer
This is the size of each segment, in bits. The allowed range is 1..256 and is set by default to 1 for integers, floats and bit strings and to 8 for binary. The utf8, utf16 and utf32 types require no unit to be defined. The multiplication of Size by Unit is equal to the number of bits the segment will take and must be evenly divisible by 8. The unit size is usually used to ensure byte-alignment.
The TypeSpecifierList is built by separating attributes by a '-'.
UTF-8的資料:http://www.zehnet.de/2005/02/12/unicode-utf-8-tutorial/
An Essay on Endian Order
http://people.cs.umass.edu/~verts/cs32/endian.html
Copyright (C) Dr. William T. Verts, April 19, 1996Depending on which computing system you use, you will have to consider the byte order in which multibyte numbers are stored, particularly when you are writing those numbers to a file. The two orders are called "Little Endian" and "Big Endian".
The Basics
"Little Endian" means that the low-order byte of the number is stored in memory at the lowest address, and the high-order byte at the highest address. (The little end comes first.) For example, a 4 byte LongIntByte3 Byte2 Byte1 Byte0will be arranged in memory as follows:
Base Address+0 Byte0 Base Address+1 Byte1 Base Address+2 Byte2 Base Address+3 Byte3Intel processors (those used in PC's) use "Little Endian" byte order.
"Big Endian" means that the high-order byte of the number is stored in memory at the lowest address, and the low-order byte at the highest address. (The big end comes first.) Our LongInt, would then be stored as:
Base Address+0 Byte3 Base Address+1 Byte2 Base Address+2 Byte1 Base Address+3 Byte0Motorola processors (those used in Mac's) use "Big Endian" byte order.
Which is Better?
You may see a lot of discussion about the relative merits of the two formats, mostly religious arguments based on the relative merits of the PC versus the Mac. Both formats have their advantages and disadvantages.In "Little Endian" form, assembly language instructions for picking up a 1, 2, 4, or longer byte number proceed in exactly the same way for all formats: first pick up the lowest order byte at offset 0. Also, because of the 1:1 relationship between address offset and byte number (offset 0 is byte 0), multiple precision math routines are correspondingly easy to write.
In "Big Endian" form, by having the high-order byte come first, you can always test whether the number is positive or negative by looking at the byte at offset zero. You don't have to know how long the number is, nor do you have to skip over any bytes to find the byte containing the sign information. The numbers are also stored in the order in which they are printed out, so binary to decimal routines are particularly efficient.
What does that Mean for Us?
What endian order means is that any time numbers are written to a file, you have to know how the file is supposed to be constructed. If you write out a graphics file (such as a .BMP file) on a machine with "Big Endian" integers, you must first reverse the byte order, or a "standard" program to read your file won't work.The Windows .BMP format, since it was developed on a "Little Endian" architecture, insists on the "Little Endian" format. You must write your Save_BMP code this way, regardless of the platform you are using.
Common file formats and their endian order are as follows:
- Adobe Photoshop -- Big Endian
- BMP (Windows and OS/2 Bitmaps) -- Little Endian
- DXF (AutoCad) -- Variable
- GIF -- Little Endian
- IMG (GEM Raster) -- Big Endian
- JPEG -- Big Endian
- FLI (Autodesk Animator) -- Little Endian
- MacPaint -- Big Endian
- PCX (PC Paintbrush) -- Little Endian
- PostScript -- Not Applicable (text!)
- POV (Persistence of Vision ray-tracer) -- Not Applicable (text!)
- QTM (Quicktime Movies) -- Little Endian (on a Mac!)
- Microsoft RIFF (.WAV & .AVI) -- Both
- Microsoft RTF (Rich Text Format) -- Little Endian
- SGI (Silicon Graphics) -- Big Endian
- Sun Raster -- Big Endian
- TGA (Targa) -- Little Endian
- TIFF -- Both, Endian identifier encoded into file
- WPG (WordPerfect Graphics Metafile) -- Big Endian (on a PC!)
- XWD (X Window Dump) -- Both, Endian identifier encoded into file
Correcting for the Non-Native Order
It is pretty easy to reverse a multibyte integer if you find you need the other format. A single function can be used to switch from one to the other, in either direction. A simple and not very efficient version might look as follows: Function Reverse (N:LongInt) : LongInt ; Var B0, B1, B2, B3 : Byte ; Begin B0 := N Mod 256 ; N := N Div 256 ; B1 := N Mod 256 ; N := N Div 256 ; B2 := N Mod 256 ; N := N Div 256 ; B3 := N Mod 256 ; Reverse := (((B0 * 256 + B1) * 256 + B2) * 256 + B3) ; End ; A more efficient version that depends on the presence of hexadecimal numbers, bit masking operators AND, OR, and NOT, and shift operators SHL and SHR might look as follows:Function Reverse (N:LongInt) : LongInt ; Var B0, B1, B2, B3 : Byte ; Begin B0 := (N AND $000000FF) SHR 0 ; B1 := (N AND $0000FF00) SHR 8 ; B2 := (N AND $00FF0000) SHR 16 ; B3 := (N AND $FF000000) SHR 24 ; Reverse := (B0 SHL 24) OR (B1 SHL 16) OR (B2 SHL 8) OR (B3 SHL 0) ; End ; There are certainly more efficient methods, some of which are quite machine and platform dependent. Use what works best.詳見:堅強哥的博客