算術編碼Arithmetic Coding－高質量代碼實現詳解

時間 2019-11-13

標籤算術編碼 arithmetic coding 高質量代碼實現詳解欄目字符編碼简体版

原文原文鏈接

關於算術編碼的具體講解我很少細說，本文按照下述三個部分構成。html

兩個例子分別說明怎麼用算數編碼進行編碼以及解碼（來源：ARITHMETIC CODING FOR DATA COIUPRESSION）；
接下來我會給出算術編碼的壓縮效果接近熵編碼的證實方法（這一部分參考惠普公司的論文:Introduction to Arithmetic Coding - Theory and Practice）；
最後我會詳細說明一下算數編碼的實現代碼（代碼來源:ACM87 ARITHMETIC CODING FOR DATA COIUPRESSION）；

一, 直觀上去認識算術編碼

編碼過程：將字符映射到 [0,1) 的區間的一個數算法

稍微說明一下，一開始將區間分爲好幾段，每一段表示一個字符。編碼字符e的時候，就把原先區間表示e的那一段放大，對這個區間進行劃分得到子區間，每一個子區間也是表明一個字符。依次進行下去。編碼結束的時候得到的那個區間就是咱們要的，咱們能夠在這中間取個數就行了。ide

　　　僞代碼是這樣的：oop

解碼過程：將編碼獲得的數還原成字符串。this

　　大概思路是這樣的，就是每次看那個數處落在哪一個子區間段，而後輸出這個區間段所表示的字符。以後，調整區間以及這個數，遞歸知道輸出全部編碼字符爲止。編碼

二，證實算術編碼的壓縮效率

首先咱們得確切知道咱們到底編碼出來的是什麼，而後咱們才能去進一步去證實。spa

通過上一步的直觀認識，咱們應該知道編碼結束的時候咱們得到一個最終的區間，而後取這個區間中的一個值來表示最終的編碼。在實踐中，咱們是輸出子區間上下界中的共同位。好比咱們最終獲得的區間是[0.1010011,0.1010000)那麼共同位就是0.10100，固然嘍，方便起見，咱們就只保存10100就行了，而把小數點什麼的去掉。3d

接下來就是證實了。code

三，實現代碼詳解

着重講一下編碼過程當中字符編碼的實現，先看一下代碼。功能在於完成一個字符的編碼工做htm

   1:  static void bit_plus_follow(int);   /* Routine that follows                    */

   2:  static code_value low, high;    /* Ends of the current code region          */

   3:  static long bits_to_follow;     /* Number of opposite bits to output after */

4:

5:

   6:  void encode_symbol(int symbol,int cum_freq[])

   7:  {

   8:      long range;                 /* Size of the current code region          */

   9:      range = (long)(high-low)+1;

10:

  11:      high = low + (range*cum_freq[symbol-1])/cum_freq[0]-1;  /* Narrow the code region  to that allotted to this */

  12:      low = low + (range*cum_freq[symbol])/cum_freq[0]; /* symbol.                  */

13:

  14:      for (;;)

  15:      {                                  /* Loop to output bits.     */

  16:          if (high<Half) {

  17:              bit_plus_follow(0);                 /* Output 0 if in low half. */

  18:          }

  19:          else if (low>=Half) {                   /* Output 1 if in high half.*/

  20:              bit_plus_follow(1);

  21:              low -= Half;

  22:              high -= Half;                       /* Subtract offset to top. */

  23:          }

  24:          else if (low>=First_qtr  && high<Third_qtr) {  /* Output an opposite bit　later if in middle half. */

  25:                  bits_to_follow += 1;

  26:                  low -= First_qtr;                   /* Subtract offset to middle*/

  27:                  high -= First_qtr;

  28:          }

  29:          else break;                             /* Otherwise exit loop.     */

  30:          low = 2*low;

  31:          high = 2*high+1;                        /* Scale up code range.     */

  32:      }

  33:  }

34:

  35:  static void bit_plus_follow(int bit)

  36:  {

  37:      output_bit(bit);                           /* Output the bit.           */

  38:      while (bits_to_follow>0) {

  39:          output_bit(!bit);                      /* Output bits_to_follow     */

  40:          bits_to_follow -= 1;                   /* opposite bits. Set        */

  41:      }                                          /* bits_to_follow to zero.   */

  42:  }

詳細說明：

6-12行就是簡單地計算，根據當前編碼字符找到咱們須要的子區間。前面講到僞代碼的時候編碼到這一步的時候就已經完成對該字符的編碼，即將對下一字符編碼了。但是，實際操做的時候，咱們看到這樣一次次運行，區間會愈來愈小，也就意味着要存的那個數位數愈來愈多，那麼咱們的計算機能不能存下呢？這是個很嚴重的問題。

解決的方法是這樣的，咱們注意到，要是區間的上下界中前面幾個字符是同樣的，那麼之後編碼的時候它們仍是同樣不變的.舉個例子，要是編碼區間爲[0.1101,0.1111)，那麼後來再怎麼編碼，獲得的區間仍是[0.11~,0.11~)前面幾個字符是同樣的。那麼咱們是否是能夠進行輸出了呢，這樣就能夠避免溢出啦！16-23行代碼就是執行這個的。

細心的同窗就發現了還有24-28行代碼的存在，他們是幹嗎的呢？

咱們舉個，就是說區間卡在0.5這個地方，區間爲[0.10~,0.01~)那麼這種狀況怎麼處理？由於顯然要是始終這樣下去的話，16-23行代碼是無能爲力的。對此咱們也是能夠處理的。

此時的區間上下界應該是相似這樣，前面相同的部分咱們就不看了，默認已經由16-23行代碼處理完畢。

咱們先看這個例子，假設區間是[0.011,0.101)，那麼畫圖來看的話區間就是處於[3/8,6/8)之間，咱們將原先區間的[2/8,6/8)放大一倍，那麼此時原先的子區間就變成了[2/8,1)，能夠參見下圖。

咱們注意到放大後,若是編碼下一個字符的時候，子區間存在於上半部分，也就是上圖右邊[4/8,1)之間，那麼也就是上圖左邊[4/8,6/8)的位置，這個部分的編碼爲10,因此輸出10。

經過這個例子咱們就知道怎麼處理了。

首先記錄一下從[2/8,6/8)放大到區間[0,1)的次數bits_to_follow ,直到區間長度大於0.5爲止。

而後開始編碼下一個字符，若是區間存在於上半部，則輸出10000，其中0的個數爲bits_to_follow 個。

若是區間存在於下半部，則輸出01111，其中1的個數爲bits_to_follow 個。若是區間位於[2/8,6/8)則繼續放大，bits_to_follow 也隨之增長。

建議你們本身畫圖好好體會一下這段代碼的妙處！

如今給出所有代碼：不少小細節有待本身去研究，很微妙的。

  1 #include<cstdio>
  2 #include<stdlib.h>
  3 using namespace::std;
  4 
  5 #define Code_value_bits 16              /* Number of bits in a code value   */
  6 typedef long code_value;                /* Type of an arithmetic code value */
  7 
  8 #define Top_value (((long)1<<Code_value_bits)-1)      /* Largest code value */
  9 
 10 
 11 #define First_qtr (Top_value/4+1)       /* Point after first quarter        */
 12 #define Half      (2*First_qtr)         /* Point after first half           */
 13 #define Third_qtr (3*First_qtr)         /* Point after third quarter        */
 14 
 15 #define No_of_chars 256                 /* Number of character symbols      */
 16 #define EOF_symbol (No_of_chars+1)      /* Index of EOF symbol              */
 17 
 18 #define No_of_symbols (No_of_chars+1)   /* Total number of symbols          */
 19 
 20 /* TRANSLATION TABLES BETWEEN CHARACTERS AND SYMBOL INDEXES. */
 21 
 22 int char_to_index[No_of_chars];         /* To index from character          */
 23 unsigned char index_to_char[No_of_symbols+1]; /* To character from index    */
 24 
 25 /* CUMULATIVE FREQUENCY TABLE. */
 26 
 27 #define Max_frequency 16383             /* Maximum allowed frequency count */
 28 /*   2^14 - 1                       */
 29 int cum_freq[No_of_symbols+1];          /* Cumulative symbol frequencies    */
 30 
 31 //固定頻率表，爲了方便起見
 32 int freq[No_of_symbols+1] = {
 33     0,
 34     1,   1,   1,   1,   1,   1,   1,   1,   1,   1, 124,   1,   1,   1,   1,   1,
 35     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 36 
 37     /*      !    "    #    $    %    &    '    (    )    *    +    ,    -    .    / */
 38     1236,   1, 21,   9,   3,   1, 25, 15,   2,   2,   2,   1, 79, 19, 60,   1,
 39 
 40     /* 0    1    2    3    4    5    6    7    8    9    :    ;    <    =    >    ? */
 41     15, 15,   8,   5,   4,   7,   5,   4,   4,   6,   3,   2,   1,   1,   1,   1,
 42 
 43     /* @    A    B    C    D    E    F    G    H    I    J    K    L    M    N    O */
 44     1, 24, 15, 22, 12, 15, 10,   9, 16, 16,   8,   6, 12, 23, 13, 11,
 45 
 46     /* P    Q    R    S    T    U    V    W    X    Y    Z    [    /    ]    ^    _ */
 47     14,   1, 14, 28, 29,   6,   3, 11,   1,   3,   1,   1,   1,   1,   1,   3,
 48 
 49     /* '    a    b    c    d    e    f    g    h    i    j    k    l    m    n    o */
 50     1, 491, 85, 173, 232, 744, 127, 110, 293, 418,   6, 39, 250, 139, 429, 446,
 51 
 52     /* p    q    r    s    t    u    v    w    x    y    z    {    |    }    ~      */
 53     111,   5, 388, 375, 531, 152, 57, 97, 12, 101,   5,   2,   1,   2,   3,   1,
 54 
 55     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 56     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 57     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 58     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 59     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 60     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 61     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 62     1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,   1,
 63     1
 64 };
 65 
 66 //用來存儲編碼值，是編碼解碼過程的橋樑。大小暫定１００，實際中能夠修改
 67 char code[100];
 68 static int code_index=0;
 69 static int decode_index=0; 
 70 
 71 //buffer爲八位緩衝區，暫時存放編碼制
 72 static int buffer;      
 73 //buffer中還有幾個比特沒有用到，初始值爲8
 74 static int bits_to_go;        
 75 //超過了EOF的字符，也是垃圾
 76 static int garbage_bits;      
 77 
 78 //啓用字符頻率統計模型，也就是計算各個字符的頻率分佈區間
 79 void start_model(){
 80     int i;
 81     for (i = 0; i<No_of_chars; i++) {          
 82         //爲了便於查找
 83         char_to_index[i] = i+1;                
 84         index_to_char[i+1] = i;                
 85     }
 86 
 87     //累計頻率cum_freq[i-1]=freq[i]+...+freq[257], cum_freq[257]=0;
 88     cum_freq[No_of_symbols] = 0;
 89     for (i = No_of_symbols; i>0; i--) {       
 90         cum_freq[i-1] = cum_freq[i] + freq[i]; 
 91     }
 92     //這條語句是爲了確保頻率和的上線，這是後話，這裏就註釋掉
 93     //if (cum_freq[0] > Max_frequency);   /* Check counts within limit*/
 94 }
 95 
 96 
 97 //初始化緩衝區，便於開始接受編碼值
 98 void start_outputing_bits()
 99 {  
100     buffer = 0;                                //緩衝區一開始爲空
101     bits_to_go = 8;                          
102 }
103 
104 
105 void output_bit(int bit)
106 {  
107     //爲了寫代碼方便，編碼數據是從右到左進入緩衝區的。記住這一點！
108     buffer >>= 1;                              
109     if (bit) buffer |= 0x80;
110     bits_to_go -= 1;
111     //當緩衝區滿了的時候，就輸出存起來
112     if (bits_to_go==0) {                        
113         code[code_index]=buffer;
114         code_index++;
115 
116         bits_to_go = 8; //從新恢復爲8
117     }
118 }
119 
120 
121 void done_outputing_bits()
122 {   
123     //編碼最後的時候，當緩衝區沒有滿，則直接補充０
124     code[code_index]=buffer>>bits_to_go;
125     code_index++;
126 }
127 
128 
129 
130 static void bit_plus_follow(int);   /* Routine that follows                    */
131 static code_value low, high;    /* Ends of the current code region          */
132 static long bits_to_follow;     /* Number of opposite bits to output after */
133 
134 
135 void start_encoding()
136 {   
137     for(int i=0;i<100;i++)code[i]='\0';
138 
139     low = 0;                            /* Full code range.                 */
140     high = Top_value;
141     bits_to_follow = 0;                 /* No bits to follow           */
142 }
143 
144 
145 void encode_symbol(int symbol,int cum_freq[])
146 {  
147     long range;                 /* Size of the current code region          */
148     range = (long)(high-low)+1;
149 
150     high = low + (range*cum_freq[symbol-1])/cum_freq[0]-1;  /* Narrow the code region  to that allotted to this */
151     low = low + (range*cum_freq[symbol])/cum_freq[0]; /* symbol.                  */
152 
153     for (;;)
154     {                                  /* Loop to output bits.     */
155         if (high<Half) {
156             bit_plus_follow(0);                 /* Output 0 if in low half. */
157         }
158         else if (low>=Half) {                   /* Output 1 if in high half.*/
159             bit_plus_follow(1);
160             low -= Half;
161             high -= Half;                       /* Subtract offset to top. */
162         }
163         else if (low>=First_qtr  && high<Third_qtr) {  /* Output an opposite bit　later if in middle half. */
164                 bits_to_follow += 1;
165                 low -= First_qtr;                   /* Subtract offset to middle*/
166                 high -= First_qtr;
167         }
168         else break;                             /* Otherwise exit loop.     */
169         low = 2*low;
170         high = 2*high+1;                        /* Scale up code range.     */
171     }
172 }
173 
174 /* FINISH ENCODING THE STREAM. */
175 
176 void done_encoding()
177 {   
178     bits_to_follow += 1;                       /* Output two bits that      */
179     if (low<First_qtr) bit_plus_follow(0);     /* select the quarter that   */
180     else bit_plus_follow(1);                   /* the current code range    */
181 }                                              /* contains.                 */
182 
183 
184 static void bit_plus_follow(int bit)
185 {  
186     output_bit(bit);                           /* Output the bit.           */
187     while (bits_to_follow>0) {
188         output_bit(!bit);                      /* Output bits_to_follow     */
189         bits_to_follow -= 1;                   /* opposite bits. Set        */
190     }                                          /* bits_to_follow to zero.   */
191 }
192 
193 
194 
195 void encode(){
196     start_model();                             /* Set up other modules.     */
197     start_outputing_bits();
198     start_encoding();
199     for (;;) {                                 /* Loop through characters. */
200         int ch; 
201         int symbol;
202         ch = getchar();                      /* Read the next character. */
203         //if (ch==EOF) break;                    /* Exit loop on end-of-file. */
204         //爲了簡單起見，這裏就不用EOF爲結尾了，直接使用回車符做爲結尾。這不影響說明算法的原理
205         if(ch==10)break;
206         symbol = char_to_index[ch];            /* Translate to an index.    */
207         encode_symbol(symbol,cum_freq);        /* Encode that symbol.       */
208 
209     }
210     //將EOF編碼進去，做爲終止符
211     encode_symbol(EOF_symbol,cum_freq);       
212     done_encoding();                           /* Send the last few bits.   */
213     done_outputing_bits();
214 
215 }
216 
217 
218 //解碼
219 
220 static code_value value;        /* Currently-seen code value                */
221 
222 void start_inputing_bits()
223 {   
224     bits_to_go = 0;                             /* Buffer starts out with   */
225     garbage_bits = 0;                           /* no bits in it.           */
226 }
227 
228 
229 int input_bit()
230 {   
231     int t;
232 
233     if (bits_to_go==0) {   
234         buffer = code[decode_index];
235         decode_index++;
236 
237     //    if (buffer==EOF) {
238         if(decode_index > code_index ){
239             garbage_bits += 1;                      /* Return arbitrary bits*/
240             if (garbage_bits>Code_value_bits-2) {   /* after eof, but check */
241                 fprintf(stderr,"Bad input file/n"); /* for too many such.   */
242                 // exit(-1);
243             }
244         }
245         bits_to_go = 8;
246     }
247     //從左到右取出二進制位，由於存的時候是從右到左
248     t = buffer&1;                               /* Return the next bit from */
249     buffer >>= 1;                               /* the bottom of the byte. */
250     bits_to_go -= 1;
251     return t;
252 }
253 
254 void start_decoding()
255 {   
256     int i;
257     value = 0;                                  /* Input bits to fill the   */
258     for (i = 1; i<=Code_value_bits; i++) {      /* code value.              */
259         value = 2*value+input_bit();
260     }
261 
262 
263     low = 0;                                    /* Full code range.         */
264     high = Top_value;
265 }
266 
267 
268 int decode_symbol(int cum_freq[])
269 {   
270     long range;                 /* Size of current code region              */
271     int cum;                    /* Cumulative frequency calculated          */
272     int symbol;                 /* Symbol decoded */
273     range = (long)(high-low)+1;
274     cum = (((long)(value-low)+1)*cum_freq[0]-1)/range;    /* Find cum freq for value. */
275         
276     for (symbol = 1; cum_freq[symbol]>cum; symbol++) ; /* Then find symbol. */
277     high = low + (range*cum_freq[symbol-1])/cum_freq[0]-1;   /* Narrow the code region   *//* to that allotted to this */
278     low = low +  (range*cum_freq[symbol])/cum_freq[0];
279 
280     for (;;) {                                  /* Loop to get rid of bits. */
281         if (high<Half) {
282             /* nothing */                       /* Expand low half.         */
283         }
284         else if (low>=Half) {                   /* Expand high half.        */
285             value -= Half;
286             low -= Half;                        /* Subtract offset to top. */
287             high -= Half;
288         }
289         else if (low>=First_qtr && high <Third_qtr) {
290             value -= First_qtr;
291             low -= First_qtr;                   /* Subtract offset to middle*/
292             high -= First_qtr;
293         }
294         else break;                             /* Otherwise exit loop.     */
295         low = 2*low;
296         high = 2*high+1;                        /* Scale up code range.     */
297         value = 2*value+input_bit();            /* Move in next input blt. */
298     }
299     return symbol;
300 }
301 
302 
303 void decode(){
304     start_model();                              /* Set up other modules.    */
305     start_inputing_bits();
306     start_decoding();
307     for (;;) {                                  /* Loop through characters. */
308         int ch; int symbol;
309         symbol = decode_symbol(cum_freq);       /* Decode next symbol.      */
310         if (symbol==EOF_symbol) break;          /* Exit loop if EOF symbol. */
311         ch = index_to_char[symbol];             /* Translate to a character.*/
312         putc(ch,stdout);                        /* Write that character.    */
313     }
314 }
315 
316 int main()
317 {
318     encode();
319     decode();
320     system("pause");
321     return 0;
322 }