[IR] BWT+MTF+AC

 

BWT (Burrows–Wheeler_transform)數據轉換算法html

MTF(Move-to-front transform)數據轉換web

基於統計的壓縮算法:遊程編碼算法

良心PPT: bwt_based_compression_verbin.pptapp


 

BWT Ideaide

壓縮技術主要的工做方式就是找到重複的模式,進行緊密的編碼。post

BWT(Burrows–Wheeler_transform)將原來的文本轉換爲一個類似的文本,轉換後使得相同的字符位置連續或者相鄰;編碼

以後能夠使用其餘技術如:Move-to-front transform 和 遊程編碼(RLE) 進行文本壓縮。url

 

通常壓縮能夠將文本先使用Burrows–Wheeler transform生成局部相關性很好的序列再使用MTF減小信息熵最後再進行壓縮。idea

 

Burrows–Wheeler transform + Run-length codingspa

Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
RLE:aab4ccac3rcba10b5a$

 

Encoding:

1. All rotations

#BANANAS
S#BANANA
AS#BANAN
NAS#BANA
ANAS#BAN
NANAS#BA
ANANAS#B
BANANAS#

2. Sort the rows and pick up the last col.

#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA

正好是個行列一致的Matrix。

ori=[#BANANAS]T

bwt=[SBNN#AAA]T

 

Decoding:

while循環len(bwt)次 ) {

  "前插一列,字典排序"

}

Result:

#BANANAS
ANANAS#B
ANAS#BAN
AS#BANAN
BANANAS#
NANAS#BA
NAS#BANA
S#BANANA

 


 

MTF Idea: 

可見,BWT轉換將文本轉換爲局部相關性很好的序列。

而恰好MTF轉換就是利用了空間局部性原理減小信息熵。

即:最近訪問的字符老是出如今「recently used symbols」的前面位置,若是字符的空間局部性較好,編碼以後就會出現不少小的數字,如」0「或」1「。

因此,在以下List列中,使用的字符便可提早到head pos處,如此一來,index標號比較小,省空間。

這樣,就方便了下一步壓縮策略。

The main idea is that each symbol in the data is replaced by its index in the stack of 「recently used symbols」.

For example, long sequences of identical symbols are replaced by as many zeroes, whereas when a symbol that has not been used in a long time appears, it is replaced with a large number.

Thus at the end the data is transformed into a sequence of integers; if the data exhibits a lot of local correlations, then these integers tend to be small.

 

Encoding:

先創建字符集大小的List,「recently used symbols」,這裏只考慮26個小寫字母a~z。

From: https://en.wikipedia.org/wiki/Move-to-front_transform

Input Stream Sequence List Operation
bananaaa 1 (abcdefghijklmnopqrstuvwxyz) b放到head位置
bananaaa 1,1 (bacdefghijklmnopqrstuvwxyz) a放到head位置
bananaaa 1,1,13 (abcdefghijklmnopqrstuvwxyz) n放到head位置
bananaaa 1,1,13,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1 (anbcdefghijklmopqrstuvwxyz) n放到head位置
bananaaa 1,1,13,1,1,1 (nabcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1,1,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置
bananaaa 1,1,13,1,1,1,0,0 (anbcdefghijklmopqrstuvwxyz) a放到head位置

Final

1,1,13,1,1,1,0,0

(anbcdefghijklmopqrstuvwxyz)

 

ori=[b a n  a n a a a]T

mtf=[1 1 13 1 1 1 0 0]T

 

Decoding:

Input Stream Sequence     List
1,1,13,1,1,1,0,0 b
(abcdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 ba (bacdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 ban (abcdefghijklmnopqrstuvwxyz)
1,1,13,1,1,1,0,0 bana (nabcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 banan (anbcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 banana (nabcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 bananaa (anbcdefghijklmopqrstuvwxyz)
1,1,13,1,1,1,0,0 bananaaa (anbcdefghijklmopqrstuvwxyz)

 

可見,這裏的思想就在於,利用了變化過程當中的順序性!

這樣一來,string變爲了數字。若是以前的string是對應一堆大的數字的話,那麼,此次變爲了一堆小數字,且0,1較多,Entropy小了!

 

Burrows–Wheeler transform + Move-To-Front

Ori:rabcabcababaabacabcabcabcababaa$
BWT:aabbbbccacccrcbaaaaaaaaaabbbbba$
MTF:0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,$
ARI:<(6+1)bytes>

 

問:MTF是否確實提升了壓縮率呢?一堆小數字對壓縮來講總會是好事。減少了信息熵,便天然想到了huffman或者arithmetic coding。

 

過程以下:

Input Stream Sequence List
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1 (abcdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0 (bacdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2 (bacdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1 (acbdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0 (cabdefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17 (cbadefghijklmnopqrstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1 (rcbadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2 (rcbadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3 (brcadefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0 (abrcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1 (abrcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0 (barcdefghijklmnopqstuvwxyz)
aabbbbccacccrcbaaaaaaaaaabbbbba$ 0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1 (barcdefghijklmnopqstuvwxyz)

Final

0,0,1,0,0,0,2,0,2,1,0,0,0,17,1,2,3,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1

 

 

繼續採用Arithmetic coding處理:

32 in total(暫未考慮$)

0:  22/32

1:  5/32

2:  3/32

3:  1/32

17: 1/32

使用了8 Bytes

 

計算Entropy:

計算式:(-22/32)*log2(22/32)+(-5/32)*log2(5/32)+(-3/32)*log2(3/32)+(-1/32)*log2(1/32)+(-1/32)*log2(1/32)

計算器:https://web2.0calc.com/


 

 

對比: 

大約每一個字符平均採用1.43bits來進行編碼,也就是:

1.43*32=45.76 約爲46bits,即6Bytes

加上結尾符:6+1($) = 7Bytes

Finally, 15 Bytes 搞定!

 

RLE:aab4ccac3rcba10b5a$

共18 Bytes!

 


 

總結:

有不少種組合方式,以上只是探討了BWT+MTF+Arithmetic Coding的效果。

 


 

BWT+Run-Length --> Run-length FM-Index

 

參考: [IR] Advanced XML Compression - XBW

都利用了一樣的壓縮思想(Run-length encoding)。這裏,經過B、S列替代了Last列。B列更有利於壓縮。

 

注意這裏是'1'爲首,'0'爲尾。

經過B'、S列替代了Front列。B'列更有利於壓縮。但S須要在使用前排序,也就是S-->C列。

注意:S、C列中每一個elem表明的是block。且相同elem內部具備保序性。

index  Last   B   S*   B'  Front   C* 
1 e 1 e 1 $ $
2 e 0 d 1 _ _
3 d 1 _ 1 _ _
4 _ 1 n 1 a a
5 n 1 r 1 d d
6 r 1 h 1 e e
7 r 0 t 0 e e
8 h 1 $ 1 e h
9 h 0 a 0 e n
10 t 1 e 1 h r
11 $ 1 _ 0 h t
12 a 1   1 n  
13 e 1   1 r  
14 e 0   0 r  
15 _ 1   1 t  

 

如此這般,

B+B' = 15bits + 15bits = 2 Bytes.

S:11 Bytes

In total: 2+11+1(表len)=14 Bytes.

 

原來的方式:

Last列:15 Bytes

C Table (for Front列):row:9 * col:2 = 18 Bytes

 $     1   
_ 2
a 4
d 5
e 6
h 10
n 12
r 13
t 15

In total: 15+18=33 Bytes.

 

時間換空間,仍是有點效果。

本站公眾號
   歡迎關注本站公眾號,獲取更多信息