lucene底層數據結構——FST，針對field使用列存儲，delta encode壓縮doc ids數組，LZ4壓縮算法

時間 2019-11-16

標籤 lucene 底層數據結構 fst 針對 field 使用存儲 delta encode 壓縮 doc ids 數組 lz4 算法欄目 HTTP/TCP 简体版

原文原文鏈接

參考：mysql

http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal算法

http://www.slideshare.net/jpountz/how-does-lucene-store-your-datasql

http://www.infoq.com/cn/articles/database-timestamp-02?utm_source=infoq&utm_medium=related_content_link&utm_campaign=relatedContent_articles_clk數組

摘錄一些重要的：緩存

看一下Lucene的倒排索引是怎麼構成的。數據結構

咱們來看一個實際的例子，假設有以下的數據：less

dociddom	年齡ide	性別post
1	18	女
2	20	女
3	18	男

這裏每一行是一個document。每一個document都有一個docid。那麼給這些document創建的倒排索引就是：

年齡

18	[1,3]
20	[2]

性別

女	[1,2]
男	[3]

能夠看到，倒排索引是per field的，一個字段有一個本身的倒排索引。18,20這些叫作 term，而[1,3]就是posting list。Posting list就是一個int的數組，存儲了全部符合某個term的文檔id。那麼什麼是term dictionary 和 term index？

那麼什麼是term dictionary 和 term index？

假設咱們有不少個term，好比：

Carla,Sara,Elin,Ada,Patty,Kate,Selena

若是按照這樣的順序排列，找出某個特定的term必定很慢，由於term沒有排序，須要所有過濾一遍才能找出特定的term。排序以後就變成了：

Ada,Carla,Elin,Kate,Patty,Sara,Selena

這樣咱們能夠用二分查找的方式，比全遍歷更快地找出目標的term。這個就是 term dictionary。有了term dictionary以後，能夠用 logN 次磁盤查找獲得目標。可是磁盤的隨機讀操做仍然是很是昂貴的（一次random access大概須要10ms的時間）。因此儘可能少的讀磁盤，有必要把一些數據緩存到內存裏。可是整個term dictionary自己又太大了，沒法完整地放到內存裏。因而就有了term index。term index有點像一本字典的大的章節表。好比：

A開頭的term ……………. Xxx頁

C開頭的term ……………. Xxx頁

E開頭的term ……………. Xxx頁

若是全部的term都是英文字符的話，可能這個term index就真的是26個英文字符表構成的了。可是實際的狀況是，term未必都是英文字符，term能夠是任意的byte數組。並且26個英文字符也未必是每個字符都有均等的term，好比x字符開頭的term可能一個都沒有，而s開頭的term又特別多。實際的term index是一棵trie 樹：

例子是一個包含 "A", "to", "tea", "ted", "ten", "i", "in", 和 "inn" 的 trie 樹。這棵樹不會包含全部的term，它包含的是term的一些前綴。經過term index能夠快速地定位到term dictionary的某個offset，而後從這個位置再日後順序查找。再加上一些壓縮技術（搜索 Lucene Finite State Transducers） term index 的尺寸能夠只有全部term的尺寸的幾十分之一，使得用內存緩存整個term index變成可能。總體上來講就是這樣的效果。

如今咱們能夠回答「爲何Elasticsearch/Lucene檢索能夠比mysql快了。Mysql只有term dictionary這一層，是以b-tree排序的方式存儲在磁盤上的。檢索一個term須要若干次的random access的磁盤操做。而Lucene在term dictionary的基礎上添加了term index來加速檢索，term index以樹的形式緩存在內存中。從term index查到對應的term dictionary的block位置以後，再去磁盤上找term，大大減小了磁盤的random access次數。

額外值得一提的兩點是：term index在內存中是以FST（finite state transducers）的形式保存的，其特色是很是節省內存。Term dictionary在磁盤上是以分block的方式保存的，一個block內部利用公共前綴壓縮，好比都是Ab開頭的單詞就能夠把Ab省去。這樣term dictionary能夠比b-tree更節約磁盤空間。

--------------------------------------------------------

lucene並不是使用Tree structure
– sorted for range queries
– O(log(n)) search

而是以下核心的數據結構，FST，delta encode壓縮數組，列存儲，LZ4壓縮算法：
●Terms index: map a term prefix to a block in the dict ○ FST: automaton with weighted arcs, compact thanks to shared prefixes/suffixes 核心數據結構，本質是先後綴共享的狀態機，相似trie來搜索用戶輸入的某個單詞是否能搜到，搜到的話就跳轉到Terms dictionary裏去，搜到的結果是單詞在terms dict裏的offset（本質是數組的偏移量）
Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix
●Terms dictionary: statistics + pointer in postings lists, Store terms and documents in arrays – binary search
• Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the 「BlockTree terms dict」
• read sequentially until the term is found
●Postings lists: encodes matching docs in sorted order ○ + positions + offsets 倒排的文檔ID都在此
• Jump to the given offset in the postings lists
• Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt
●Stored fields
• In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
• Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks

Query execution：
• 2 disk seeks per field for search
• 1 disk seek per doc for stored fields
• It is common that the terms dict / postings lists fits into the file-system cache
• 「Pulse」 optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys

插入新數據：
Insertion = write a new segment 一直寫信segment能夠防止使用鎖
• Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
刪除：
Deletion = turn a bit off
• Ignore deleted documents when searching and merging (reclaims space)
• Merge policies favor segments with many deletions

優缺點：
Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
• Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
• Terms are deduplicated
– saves space for high-freq terms
• Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
• Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)

針對field使用列存儲：
Per doc and per field single numeric values, stored in a column-stride fashion
• Useful for sorting and custom scoring
• Norms are numeric doc values

一些設計原則：
• Save file handles
– don’t use one file per field or per doc
• Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms
• BUT don’t ignore the filesystem cache
– random access in small files is fine
• Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly

針對Compression techniques的數據結構：FSTs LZ4