參考:mysql
http://www.slideshare.net/lucenerevolution/what-is-inaluceneagrandfinal算法
http://www.slideshare.net/jpountz/how-does-lucene-store-your-datasql
http://www.infoq.com/cn/articles/database-timestamp-02?utm_source=infoq&utm_medium=related_content_link&utm_campaign=relatedContent_articles_clk數組
摘錄一些重要的:緩存
看一下Lucene的倒排索引是怎麼構成的。數據結構
咱們來看一個實際的例子,假設有以下的數據:less
dociddom |
年齡ide |
性別post |
1 |
18 |
女 |
2 |
20 |
女 |
3 |
18 |
男 |
這裏每一行是一個document。每一個document都有一個docid。那麼給這些document創建的倒排索引就是:
18 |
[1,3] |
20 |
[2] |
性別
女 |
[1,2] |
男 |
[3] |
能夠看到,倒排索引是per field的,一個字段有一個本身的倒排索引。18,20這些叫作 term,而[1,3]就是posting list。Posting list就是一個int的數組,存儲了全部符合某個term的文檔id。那麼什麼是term dictionary 和 term index?
那麼什麼是term dictionary 和 term index?
假設咱們有不少個term,好比:
Carla,Sara,Elin,Ada,Patty,Kate,Selena
若是按照這樣的順序排列,找出某個特定的term必定很慢,由於term沒有排序,須要所有過濾一遍才能找出特定的term。排序以後就變成了:
Ada,Carla,Elin,Kate,Patty,Sara,Selena
這樣咱們能夠用二分查找的方式,比全遍歷更快地找出目標的term。這個就是 term dictionary。有了term dictionary以後,能夠用 logN 次磁盤查找獲得目標。可是磁盤的隨機讀操做仍然是很是昂貴的(一次random access大概須要10ms的時間)。因此儘可能少的讀磁盤,有必要把一些數據緩存到內存裏。可是整個term dictionary自己又太大了,沒法完整地放到內存裏。因而就有了term index。term index有點像一本字典的大的章節表。好比:
A開頭的term ……………. Xxx頁
C開頭的term ……………. Xxx頁
E開頭的term ……………. Xxx頁
若是全部的term都是英文字符的話,可能這個term index就真的是26個英文字符表構成的了。可是實際的狀況是,term未必都是英文字符,term能夠是任意的byte數組。並且26個英文字符也未必是每個字符都有均等的term,好比x字符開頭的term可能一個都沒有,而s開頭的term又特別多。實際的term index是一棵trie 樹:
例子是一個包含 "A", "to", "tea", "ted", "ten", "i", "in", 和 "inn" 的 trie 樹。這棵樹不會包含全部的term,它包含的是term的一些前綴。經過term index能夠快速地定位到term dictionary的某個offset,而後從這個位置再日後順序查找。再加上一些壓縮技術(搜索 Lucene Finite State Transducers) term index 的尺寸能夠只有全部term的尺寸的幾十分之一,使得用內存緩存整個term index變成可能。總體上來講就是這樣的效果。
如今咱們能夠回答「爲何Elasticsearch/Lucene檢索能夠比mysql快了。Mysql只有term dictionary這一層,是以b-tree排序的方式存儲在磁盤上的。檢索一個term須要若干次的random access的磁盤操做。而Lucene在term dictionary的基礎上添加了term index來加速檢索,term index以樹的形式緩存在內存中。從term index查到對應的term dictionary的block位置以後,再去磁盤上找term,大大減小了磁盤的random access次數。
額外值得一提的兩點是:term index在內存中是以FST(finite state transducers)的形式保存的,其特色是很是節省內存。Term dictionary在磁盤上是以分block的方式保存的,一個block內部利用公共前綴壓縮,好比都是Ab開頭的單詞就能夠把Ab省去。這樣term dictionary能夠比b-tree更節約磁盤空間。
--------------------------------------------------------
lucene並不是使用Tree structure
– sorted for range queries
– O(log(n)) search
而是以下核心的數據結構,FST,delta encode壓縮數組,列存儲,LZ4壓縮算法:
●Terms index: map a term prefix to a block in the dict ○ FST: automaton with weighted arcs, compact thanks to shared prefixes/suffixes 核心數據結構,本質是先後綴共享的狀態機,相似trie來搜索用戶輸入的某個單詞是否能搜到,搜到的話就跳轉到Terms dictionary裏去,搜到的結果是單詞在terms dict裏的offset(本質是數組的偏移量)
Lookup the term in the terms index
– In-memory FST storing terms prefixes
– Gives the offset to look at in the terms dictionary
– Can fast-fail if no terms have this prefix
●Terms dictionary: statistics + pointer in postings lists, Store terms and documents in arrays – binary search
• Jump to the given offset in the terms dictionary
– compressed based on shared prefixes, similarly to a burst trie
– called the 「BlockTree terms dict」
• read sequentially until the term is found
●Postings lists: encodes matching docs in sorted order ○ + positions + offsets 倒排的文檔ID都在此
• Jump to the given offset in the postings lists
• Encoded using modified FOR (Frame of Reference) delta
– 1. delta-encode
– 2. split into block of N=128 values
– 3. bit packing per block
– 4. if remaining docs, encode with vInt
●Stored fields
• In-memory index for a subset of the doc ids
– memory-efficient thanks to monotonic compression
– searched using binary search
• Stored fields
– stored sequentially
– compressed (LZ4) in 16+KB blocks
Query execution:
• 2 disk seeks per field for search
• 1 disk seek per doc for stored fields
• It is common that the terms dict / postings lists fits into the file-system cache
• 「Pulse」 optimization
– For unique terms (freq=1), postings are inlined in the terms dict
– Only 1 disk seek
– Will always be used for your primary keys
插入新數據:
Insertion = write a new segment 一直寫信segment能夠防止使用鎖
• Merge segments when there are too many of them
– concatenate docs, merge terms dicts and postings lists (merge sort!)
刪除:
Deletion = turn a bit off
• Ignore deleted documents when searching and merging (reclaims space)
• Merge policies favor segments with many deletions
優缺點:
Updates require writing a new segment
– single-doc updates are costly, bulk updates preferred
– writes are sequential
• Segments are never modified in place
– filesystem-cache-friendly
– lock-free!
• Terms are deduplicated
– saves space for high-freq terms
• Docs are uniquely identified by an ord
– useful for cross-API communication
– Lucene can use several indexes in a single query
• Terms are uniquely identified by an ord
– important for sorting: compare longs, not strings
– important for faceting (more on this later)
針對field使用列存儲:
Per doc and per field single numeric values, stored in a column-stride fashion
• Useful for sorting and custom scoring
• Norms are numeric doc values
一些設計原則:
• Save file handles
– don’t use one file per field or per doc
• Avoid disk seeks whenever possible
– disk seek on spinning disk is ~10 ms
• BUT don’t ignore the filesystem cache
– random access in small files is fine
• Light compression helps
– less I/O
– smaller indexes
– filesystem-cache-friendly
針對Compression techniques的數據結構:FSTs LZ4