Ref: [IR] Compressionhtml
Ref: [IR] Link Analysisnode
Planar Graph算法
From: http://www.csie.ntnu.edu.tw/~u91029/PlanarGraph.html#1oop
由於缺少優美規律,所以談論對偶圖時,習慣忽略同構。post
最特別的對偶圖例子,就是橋( bridge )與自環( loop )。ui
舉例來說,原圖是一棵樹,對偶圖是一個點以及一大堆自環;各種樹對應各種自環包覆方式。this
Spanning Tree編碼
From: http://www.csie.ntnu.edu.tw/~u91029/SpanningTree.htmlurl
圖中提取樹的方法,可參見:[Optimization] Greedy method 中最小生成樹算法等相關內容。spa
如下探討如何壓縮Graph的策略。
Idea:
能表示在spanning tree上的鏈接使用-+方式記錄信息。
未表示在spanning tree上的鏈接則補充balance bracket。
1 | 2 | 3 | 2 | 4 | 2 | 1 | 5 | 6 | 5 | 7 | 8 | 7 | 5 | |||||||||||||
- | - | + | - | + | + | - | - | + | - | - | + | + | ||||||||||||||
(( | )(( | (( | ) | ))( | ) | ) | ) | |||||||||||||||||||
12 | 234 | 56 | 6 | 547 | 7 | 3 | 1 |
最終編碼形態:--((+-)((+((+-)-))(+-)-)++)
其實就是基於DFS表示tree,而後剩餘的連接拿平衡括號來表達。(哄小孩兒的伎倆)
鄰接矩陣,鄰接鏈表
Each vertex associated with an (sorted / unsorted) array of adjacent vertices.
More space efficient for sparse graph.
其實就是基礎的鄰接表,解決稀疏信息的問題。
Link: http://www.touchgraph.com/TGGoogleBrowser.html
面臨的問題主要是:
• Graph is highly dynamic
– Nodes and edges are added/deleted often
– Content of existing nodes is also subject to change
– Pages and hyperlinks created on the fly
• Apart from primary connected component there are also smaller disconnected components
具備的主要特色是:
Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host.
The literature reports that on average 80% of the hyperlinks are local.
Consecutivity: links within same page are likely to be consecutive respecting to the lexicographic order.
Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.
如下內容能夠combined with [IR] Compression. (都具備一樣相似的壓縮思想)
Connectivity Server: URL compression
其實就是相似於」Front coding, 前綴冗餘「的方案。
Delta Encoding of the Adjacency Lists
壓縮效果:
Avg. inlink size: 34 bits --> 8.9 bits
Avg. outlink size: 24 bits --> 11.03 bits
原理:
Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files;
more generally this is known as data differencing. Delta encoding is sometimes called delta compression,
particularly where archival histories of changes are required (e.g., in revision control software).
就是經過只記錄「差異」而達到壓縮的效果。
Interlist compression with representative list
ref : relative index of the representative adjacency list;
deletes: set of URL-ids to delete from the representative list; 刪掉第幾個data。
adds: set of URL-ids to add to the representative list. 替換爲這個data。
壓縮效果:
Avg. inlink size: 5.66 bits
Avg. outlink size: 5.61 bits
(WebGraph Framework)
-- 過程以下介紹
壓縮效果:
Avg. inlink size: 3.08 bits
Avg. outlink size: 2.89 bits
Compressing Gaps
注意:
S1-X的值看正負,而後經過v(x)來得出Successors列的頭一個值。
v(x)的值,其實:
Using copy lists
可以使用copy方式,好比這裏使用Node15 Outdegree11爲基準作01序列(1:copy操做)
其餘列以這一列爲基準,只需保存沒copy操做的便可。
但貌似在01序列中有太多的0出現,咱們能不能針對性的作些什麼?
Using copy blocks ()
Feature: copy and skip是交替進行的。
這裏有幾個地方比較繞,開啓傻瓜式的講解方式:
Encoding:
1. The last block is omitted; 忽略最後一個block。
2. The first copy block is 0 if the copy list starts with 0; ‘01’序列start with 0,則copy block 也start with 0。
3. The length is decremented by one for all blocks except the first one.
16, 10, 1, 01110011010
第一個0算是個標誌位,第二個0纔是下面的1st block的0.
1st block: 0
2nd block: 3-1=2
3rd block: 2-1=1
4th block: 2-1=1
5th block: 1-1=0
6th block: 1-1=0
7th block: 1-1=0 // The last block is omitted;
其實,最起碼,copy blocks --> copy lists。
注意,copy與skip之間是否後接本身的數字,能夠利用「遞增」的特性來判斷!
Decoding:
copy next 2+1=3 -> 15 16 17
skip next 1+1=2 -> 15 16 17
copy next 1+1=2 -> 15 16 17 22 23 24 //由於遞增,」22「比「23,24」小
skip next 0+1=1 -> 15 16 17 22 23 24
copy next 0+1=1 -> 15 16 17 22 23 24 315
copy left 0+1=1 -> 15 16 17 22 23 24 315 316 317 3041
補充:
由於 「01」 序列其實不爲0,那麼The first copy block is not 0。
The compression techniques are specialized for Web Graphs.
The average link size decreases with the increase of the graph.
The average link access time increases with the increase of the graph.
The seems to have the best trade-off between avg. bit size and access time.