[IR] Graph Compression

時間 2019-11-18

標籤 graph compression 简体版

原文原文鏈接

Ref: [IR] Compressionhtml

Ref: [IR] Link Analysisnode

Planar Graph算法

From: http://www.csie.ntnu.edu.tw/~u91029/PlanarGraph.html#1oop

由於缺少優美規律，所以談論對偶圖時，習慣忽略同構。post

最特別的對偶圖例子，就是橋（ bridge ）與自環（ loop ）。ui

舉例來說，原圖是一棵樹，對偶圖是一個點以及一大堆自環；各種樹對應各種自環包覆方式。this

Spanning Tree編碼

　　From: http://www.csie.ntnu.edu.tw/~u91029/SpanningTree.htmlurl

圖中提取樹的方法，可參見：[Optimization] Greedy method 中最小生成樹算法等相關內容。spa

如下探討如何壓縮Graph的策略。

Idea:

能表示在spanning tree上的鏈接使用-+方式記錄信息。

未表示在spanning tree上的鏈接則補充balance bracket。

1		2		3		2		4		2		1		5		6		5		7		8		7		5
	-		-		+		-		+		+		-		-		+		-		-		+		+
				((				)((		((				)		))(				)		)				)
				12				234		56				6		547				7		3				1

最終編碼形態：--((+-)((+((+-)-))(+-)-)++)

其實就是基於DFS表示tree，而後剩餘的連接拿平衡括號來表達。（哄小孩兒的伎倆）

鄰接矩陣，鄰接鏈表

Each vertex associated with an (sorted / unsorted) array of adjacent vertices.
More space efficient for sparse graph.

其實就是基礎的鄰接表，解決稀疏信息的問題。

Web Graph representation and compression

Link: http://www.touchgraph.com/TGGoogleBrowser.html

面臨的問題主要是：

• Graph is highly dynamic
– Nodes and edges are added/deleted often
– Content of existing nodes is also subject to change
– Pages and hyperlinks created on the fly
• Apart from primary connected component there are also smaller disconnected components

具備的主要特色是：

Locality: usually most of the hyperlinks are local, i.e, they point to other URLs on the same host.
The literature reports that on average 80% of the hyperlinks are local.

Consecutivity: links within same page are likely to be consecutive respecting to the lexicographic order.

Similarity: Pages on the same host tend to have many hyperlinks pointing to the same pages.

如下內容能夠combined with [IR] Compression. (都具備一樣相似的壓縮思想)

Connectivity Server: URL compression

其實就是相似於」Front coding, 前綴冗餘「的方案。

Delta Encoding of the Adjacency Lists

壓縮效果：

Avg. inlink size: 34 bits --> 8.9 bits
Avg. outlink size: 24 bits --> 11.03 bits

原理：

Delta encoding is a way of storing or transmitting data in the form of differences (deltas) between sequential data rather than complete files;

more generally this is known as data differencing. Delta encoding is sometimes called delta compression,

particularly where archival histories of changes are required (e.g., in revision control software).

就是經過只記錄「差異」而達到壓縮的效果。

Interlist compression with representative list

ref : relative index of the representative adjacency list;
deletes: set of URL-ids to delete from the representative list; 刪掉第幾個data。
adds: set of URL-ids to add to the representative list. 替換爲這個data。

壓縮效果：

Avg. inlink size: 5.66 bits
Avg. outlink size: 5.61 bits

(WebGraph Framework)

　　　　　　　　　　　　　　　　-- 過程以下介紹

壓縮效果：

Avg. inlink size: 3.08 bits
Avg. outlink size: 2.89 bits

Compressing Gaps

注意：

S₁-X的值看正負，而後經過v(x)來得出Successors列的頭一個值。

v(x)的值，其實：

- 如果奇數：x <0
- 如果偶數：x>=0

Using copy lists

可以使用copy方式，好比這裏使用Node15 Outdegree11爲基準作01序列（1：copy操做）

其餘列以這一列爲基準，只需保存沒copy操做的便可。

但貌似在01序列中有太多的0出現，咱們能不能針對性的作些什麼？

Using copy blocks ()

Feature: copy and skip是交替進行的。

這裏有幾個地方比較繞，開啓傻瓜式的講解方式：

Encoding:

1. The last block is omitted; 忽略最後一個block。
2. The first copy block is 0 if the copy list starts with 0; ‘01’序列start with 0，則copy block 也start with 0。
3. The length is decremented by one for all blocks except the first one.

16, 10, 1, 01110011010

第一個0算是個標誌位，第二個0纔是下面的1st block的0.

1st block: 0

2nd block: 3-1=2

3rd block: 2-1=1

4th block: 2-1=1

5th block: 1-1=0

6th block: 1-1=0

7th block: 1-1=0 // The last block is omitted;

其實，最起碼，copy blocks --> copy lists。

注意，copy與skip之間是否後接本身的數字，能夠利用「遞增」的特性來判斷！

Decoding:

copy next 2+1=3 -> 15 16 17

skip next 1+1=2 -> 15 16 17

copy next 1+1=2 -> 15 16 17 22 23 24 　　//由於遞增，」22「比「23，24」小

skip next 0+1=1 -> 15 16 17 22 23 24

copy next 0+1=1 -> 15 16 17 22 23 24 315

copy left 0+1=1 -> 15 16 17 22 23 24 315 316 317 3041

補充：

由於「01」序列其實不爲0，那麼The first copy block is not 0。

Conclusions

The compression techniques are specialized for Web Graphs.
The average link size decreases with the increase of the graph.
The average link access time increases with the increase of the graph.
The seems to have the best trade-off between avg. bit size and access time.

相關標籤/搜索

compression

graph

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。