[IR] Suffix Trees and Suffix Arrays

時間 2019-11-18

標籤 suffix trees arrays 简体版

原文原文鏈接

基本概念

前綴樹

匹配前綴字符串是不言自明的道理。html

1. 字符串的快速檢索前端

2. 最長公共前綴（LCP）node

等等算法

樹的壓縮

後綴樹

囊括了全部「子字符串」

以一種相對節省內存的方式，例如：c#

Let s=abab, a suffix tree of s is a compressed trie of all suffixes of s=abab$數組

後綴樹的初級構建

由長到短，逐漸增長。數據結構

完畢後進行「樹的壓縮」。app

標註起始位置；右圖綠色leaf node表示對應子串的起始位置。【ith】less

Suffix Tree 的價值

O(n²) --> O(n), how?

在 1995 年，Esko Ukkonen 發表了論文《On-line construction of suffix trees》，描述了在線性時間內構建後綴樹的方法。下面嘗試描述 Ukkonen 算法的基本實現原理，從簡單的字符串開始描述，而後擴展到更復雜的情形。dom

Ref: http://www.cnblogs.com/gaochundong/p/suffix_tree.html

Ref: https://stackoverflow.com/questions/9452701/ukkonens-suffix-tree-algorithm-in-plain-english

鑑於Suffix Array要優於Suffix Tree，這篇文章只考慮Auffix Array的快速構建。Ukkonen 算法參見以上連接。

由於是個好東西，因此但願能簡化構建的「過程」，提升效率。

爲啥是個好東西，看官請看以下實例。

後綴樹的用法

查找字符串o是否在字符串S中

方案：用S構造後綴樹，按在trie中搜索子串的方法搜索o便可。

原理：若o在S中，則o必然是S的某個後綴的前綴。

指定字符串T在字符串S中的重複次數

方案：用S+’$'構造後綴樹，搜索T結點下的葉節點數目即爲重複次數 (雖然是公共枝，但其實string中位置不一樣)

原理：若是T在S中重複了兩次，則S應有兩個後綴以T爲前綴，重複次數就天然統計出來了。

最長公共子串 (LCS)

* 自身內部比較（最長重複子串）

方案：找到最深的非葉節點。　　// 深度:從root所經歷過的字符個數，最深非葉節點所經歷的字符串起來就是最長重複子串

s=abab, 可見最深的非leaf node通過的path是"ab"，故，"ab"便是最長重複子串。

* 倆字符串比較

方案：將S1#，S2$做爲字符串壓入後綴樹，找到最深的非葉節點，且該節點的葉節點既有#也有$。

S1=abab, S2=aab，可見，最深的非leaf node通過的path是"ab"。

最長迴文 (maximal palindromes)

方案：(1) 用S+"$"及其反轉字符串+"#"構造後綴樹，(2) 找到最深的非葉節點，且該節點的葉節點既有#也有$。

Let s = cbaaba$ then s^r = abaabc#,

後綴數組

Construct Suffix Array

Drawbacks

Suffix trees consume a lot of space　　// 一個字符串的後綴形式畢竟不少
It is O(n) but the constant is quite big
Notice that if we indeed want to traverse an edge in O(1) time then we need an array of ptrs. of size |Σ| in each node

That's why Suffix Array. We loose some of the functionality but we save space.

理解後綴數組

Ref: http://cse.unl.edu/~lksoh/Classes/CSCE410_810_Fall03/sup7.html

後綴數組 vs 倒排表

Inverted indices assume that the text can be seen as a sequence of words.

Other queries such as phrases are expensive to solve.

One approach to address this problem is to use the suffix string approach. The index sees the text as one long string.

Each position in the text is considered as a text suffix (i.e., a string that goes from the text position to the end of the text).

Each suffix is thus uniquely identified by its position.

This type of index allows us to answer efficiently more complex queries.

The main drawbacks of this approach are:

(a) costly construction process,

(b) the text must be readily available at query time,

Unless complex queries are an important issue, for word-based applications, inverted files still perform better.

後綴數組 vs 後綴樹

Suffix arrays are a space efficient implementation of suffix trees.

Also, not all text positions need to be indexed. Index points are selected from the text, which point to the beginning of the text positions which will be retrievable.

以上是11, 19, 28, 33, 40, 50, 60爲起點的若干個後綴。若是用後綴樹表示，就是以下大概的樣子：

Suffix Tree 在 DFS 後就是 Suffix Array。

可見，只是字符串後綴的兩種等價的表達形式，即：數據結構不一樣。

但實際上，只須要記錄如下綠色的部分index便可。

Simply an array containing all the pointers to the
text suffixes listed in lexicographical order.

經過這個index array 並結合"原字符串"，便可推出全部必要信息。

節省了空間，但失去了必定的功能上的便捷性，好比樹結構的快速查找。

後綴數組，需採用二分查找find指定的後綴。

但問題是：

If the suffix array is large, this binary search can
perform poorly because of the number of random disk accesses.

Suffix arrays are designed to allow binary searches
done by comparing the contents of each pointer.

To remedy this situation, the use of supra-indices over the suffix array has been proposed.

後綴數組的應用

最長公共子串 (LCS)

* 自身內部比較（最長重複子串）

字符串內部比較，後綴樹表示時比較方便（見上）。

變爲後綴數組後，功能性換取了空間節省，該怎麼辦？

(1) 可見，從左到右查看最長common的部分便可。

(2) 方便使用並行策略。

* 倆字符串比較

合併後仍是採起如上策略便可。

其實，既然保有原字符串，再裝備上suffix array這樣的相似指針的數組就行了。

suffix tree由於空間大，只須要「遍歷」操做便可。

suffix array需經過「遍歷with比較」就能達到相同效果。

其餘待解決問題的sol思想相似。

Supra-indices

藉助skip pointer的思想來加速查詢。

Supra-Index至關於提取了文本中的一部分sub-string，

注意，這個（任意的）sub-string其實表明了文本字符串的某個後綴的前綴！

那麼，Supra-Index能夠快速地初步定位目標的大概區間段，而後在該區間再進行「精細地」二分查找。

快速建立後綴數組

Space Efficient Linear Time Construction of Suffix Arrays - Pang Ko and Srinivas

Video: https://www.youtube.com/watch?v=m2-N853rS6U

Algorithm: Difference Cover modulo 3 - O(n)

Suffix Array <--> BWT Matrix

(1) 後綴Set --(sort)--> 後綴數組 --(Suffix array-1)--> BWT

後綴排序後-->BWT matrix（的一半顯示部分，非綠色字體）

BWT matrix：第一列：suffix Array，但記錄的是index，而不是char。

BWT matrix：第尾列：BWT，經過suffix array-1得到。例如，6-1=5，"BANANA$"[ith=5]=N

(2) BWT --(C Table) --> char 的順序性，即: Index for 後綴數組

因此，Suffix Array若能O(n)建立，那麼BWT的建立直接由SA轉化後便可，故也能O(n)。

實戰（O(n)）

string = "MISSISSIPPI$",

Tast: construct its suffix array.

Step 1, mark L and S

type	L	S	L	L	S	L	L	S	L	L	L	L/S
index	1	2	3	4	5	6	7	8	9	10	11	12
text	M	I	S	S	I	S	S	I	P	P	I	$
distance	0	0	1	2	3	1	2	3	1	2	3	4

Step 2, create bucket

Bucket
表明了順序性	$	I				M	P		S
表明了順序性	12	2	5	8	11	1	9	10	3	4	6	7

Step 3, D lists

D lists	0	不考慮	[2] [1]
	1	[9] [3, 6]	對應的char都是'S',暫無法判斷
	2	[10] [4, 7]	對應的char都是'S',暫無法判斷
	3	[5, 8, 11]	對應的char都是'I',暫無法判斷
	4	[12]

Step 4, S-Substring

S對應的項：2, 5, 8, 12.

其實就是進一步判斷Step 3的第3, 4行。主要仍是第3行。

2	5	8	全都是I
3	6	9	9:P 靠前；故8不用再判斷
4	7		全都是S
5	8		全都是I
6	9		6:S;9:P,故，5靠前

結果就是：8, 5, 2

11也是'I'，但11是L類，二、五、8是S類，根據規則1：

　　In one bucket（確定對應一個char），L類在S類的左邊。

Bucket
表明了順序性	$	I				M	P	S
表明了順序性	12	11	8	5	2

填好的部分。這構成了下一步的起始基礎。

Step 5, complete others

Bucket
	$	I				M	P		S
	12	11	8	5	2	1	10	9	7	4	6	3
規則2	11	10	7	4	1		9	8	7	4	5

Finally, Suffix Array: [12, 11, 8, 5, 2, 1, 10, 9, 7, 4, 6, 3]

-----------------------------------------------------------------

關於規則1：

可見，$（字典排序）優先級確定大於P，故，L類必定在S類前面。

關於規則2

‘小’後綴已比較，且前端加一個char並且這個char一致，那麼，

‘小’後綴的比較結果決定‘大’後綴的比較結果。

End.

相關標籤/搜索

suffix

prefix+prefixoverrides+suffix+suffixoverrides

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。

[IR] Suffix Trees and Suffix Arrays

基本概念

前綴樹

樹的壓縮

後綴樹

囊括了全部「子字符串」

後綴樹的初級構建

Suffix Tree 的價值

O(n2) --> O(n), how?

後綴樹的用法

查找字符串o是否在字符串S中

指定字符串T在字符串S中的重複次數

最長公共子串 (LCS)

最長迴文 (maximal palindromes)

後綴數組

Construct Suffix Array

Drawbacks

理解後綴數組

後綴數組 vs 倒排表

後綴數組 vs 後綴樹

後綴數組的應用

最長公共子串 (LCS)

Supra-indices

快速建立後綴數組

Suffix Array <--> BWT Matrix

實戰（O(n)）

O(n²) --> O(n), how?