ES6.3.2,索引名稱 user_v1,5個主分片,每一個分片一個副本。分片基本都在11GB左右,GET _cat/shards/user
html
一共有3.4億文檔,主分片總共57GB。java
Segment信息:curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segment
python
user_v1索引一共有404個段:git
cat user_v1_segment | wc -lgithub
404算法
處理一下數據,用Python畫個直方圖看看效果:api
sed -i '1d' file # 刪除文件第一行微信
awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 選取感興趣的一列(docs.count 列)app
with open('doc_count.txt') as f: data=f.read() docList = data.splitlines() docNums = list(map(int,docList)) import matplotlib.pyplot as plt plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')
大概看一下每一個Segment中包含的文檔的個數。橫座標是:文檔數量,縱座標是:segment個數。可見:大部分的Segment中只包含了少許的文檔($0.5*10^7$)curl
修改refresh_interval爲30s,原來默認爲1s,這樣能在必定程度上減小Segment的數量。而後先force merge將404個Segment減小到200個:
POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true
可是一看,仍是有312個Segment。這個可能與merge的配置有關了。有興趣的能夠了解一下 force merge 過程當中這2個參數的意義:
1,Collector 時間過長,有些分片耗時長達7.9s。關於Profile 分析,可參考:profile-api
2,採用HanLP 分詞插件,Analyzer後獲得Term,竟然有"空格Term",而這個Term的匹配長達800ms!
來看看緣由:
POST /_analyze { "analyzer": "hanlp_standard", "text":"人生 如夢" }
分詞結果是包含了空格的:
{ "tokens": [ { "token": "人生", "start_offset": 0, "end_offset": 2, "type": "n", "position": 0 }, { "token": " ", "start_offset": 0, "end_offset": 1, "type": "w", "position": 1 }, { "token": "如", "start_offset": 0, "end_offset": 1, "type": "v", "position": 2 }, { "token": "夢", "start_offset": 0, "end_offset": 1, "type": "n", "position": 3 } ] }
那實際文檔被Analyzer了以後是否存儲了空格呢?
因而先定義一個索引,開啓term_vector。參考store term-vector
PUT user { "settings": { "number_of_shards": 1, "number_of_replicas": 0 }, "mappings": { "profile": { "properties": { "nick": { "type": "text", "analyzer": "hanlp_standard", "term_vector": "yes", "fields": { "raw": { "type": "keyword" } } } } } } }
而後PUT一篇文檔進去:
PUT user/profile/1 { "nick":"人生 如夢" }
查看Term Vector:docs-termvectors
GET /user/profile/1/_termvectors { "fields" : ["nick"], "offsets" : true, "payloads" : true, "positions" : true, "term_statistics" : true, "field_statistics" : true }
發現存儲的Terms裏面有空格。
{ "_index": "user", "_type": "profile", "_id": "1", "_version": 1, "found": true, "took": 2, "term_vectors": { "nick": { "field_statistics": { "sum_doc_freq": 4, "doc_count": 1, "sum_ttf": 4 }, "terms": { " ": { "doc_freq": 1, "ttf": 1, "term_freq": 1 }, "人生": { "doc_freq": 1, "ttf": 1, "term_freq": 1 }, "如": { "doc_freq": 1, "ttf": 1, "term_freq": 1 }, "夢": { "doc_freq": 1, "ttf": 1, "term_freq": 1 } } } } }
而後再執行profile 查詢分析:
GET user/profile/_search?human=true { "profile":true, "query": { "match": { "nick": "人生 如夢" } } }
發現Profile裏面竟然有針對 空格Term 的查詢!!!(注意 nick 後面有個空格)
"type": "TermQuery", "description": "nick: ", "time": "58.2micros", "time_in_nanos": 58244,
profile結果以下:
"profile": { "shards": [ { "id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]", "searches": [ { "query": [ { "type": "BooleanQuery", "description": "nick:人生 nick: nick:如 nick:夢", "time": "642.9micros", "time_in_nanos": 642931, "breakdown": { "score": 13370, "build_scorer_count": 2, "match_count": 0, "create_weight": 390646, "next_doc": 18462, "match": 0, "create_weight_count": 1, "next_doc_count": 2, "score_count": 1, "build_scorer": 220447, "advance": 0, "advance_count": 0 }, "children": [ { "type": "TermQuery", "description": "nick:人生", "time": "206.6micros", "time_in_nanos": 206624, "breakdown": { "score": 942, "build_scorer_count": 3, "match_count": 0, "create_weight": 167545, "next_doc": 1493, "match": 0, "create_weight_count": 1, "next_doc_count": 2, "score_count": 1, "build_scorer": 36637, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick: ", "time": "58.2micros", "time_in_nanos": 58244, "breakdown": { "score": 918, "build_scorer_count": 3, "match_count": 0, "create_weight": 46130, "next_doc": 964, "match": 0, "create_weight_count": 1, "next_doc_count": 2, "score_count": 1, "build_scorer": 10225, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick:如", "time": "51.3micros", "time_in_nanos": 51334, "breakdown": { "score": 888, "build_scorer_count": 3, "match_count": 0, "create_weight": 43779, "next_doc": 1103, "match": 0, "create_weight_count": 1, "next_doc_count": 2, "score_count": 1, "build_scorer": 5557, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick:夢", "time": "59.1micros", "time_in_nanos": 59108, "breakdown": { "score": 3473, "build_scorer_count": 3, "match_count": 0, "create_weight": 49739, "next_doc": 900, "match": 0, "create_weight_count": 1, "next_doc_count": 2, "score_count": 1, "build_scorer": 4989, "advance": 0, "advance_count": 0 } } ] } ], "rewrite_time": 182090, "collector": [ { "name": "CancellableCollector", "reason": "search_cancelled", "time": "25.9micros", "time_in_nanos": 25906, "children": [ { "name": "SimpleTopScoreDocCollector", "reason": "search_top_hits", "time": "19micros", "time_in_nanos": 19075 } ] } ] } ], "aggregations": [] } ] }
而在實際的生產環境中,空格Term的查詢耗時480ms,而一個正常詞語("微信")的查詢,只有18ms。以下在分片[user_v1][3]
上的profile分析結果:
"profile": { "shards": [ { "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]", "searches": [ { "query": [ { "type": "BooleanQuery", "description": "nick:微信 nick: nick:黃色", "time": "888.6ms", "time_in_nanos": 888636963, "breakdown": { "score": 513864260, "build_scorer_count": 50, "match_count": 0, "create_weight": 93345, "next_doc": 364649642, "match": 0, "create_weight_count": 1, "next_doc_count": 5063173, "score_count": 4670398, "build_scorer": 296094, "advance": 0, "advance_count": 0 }, "children": [ { "type": "TermQuery", "description": "nick:微信", "time": "18.4ms", "time_in_nanos": 18480019, "breakdown": { "score": 656810, "build_scorer_count": 62, "match_count": 0, "create_weight": 23633, "next_doc": 17712339, "match": 0, "create_weight_count": 1, "next_doc_count": 7085, "score_count": 5705, "build_scorer": 74384, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick: ", "time": "480.5ms", "time_in_nanos": 480508016, "breakdown": { "score": 278358058, "build_scorer_count": 72, "match_count": 0, "create_weight": 6041, "next_doc": 192388910, "match": 0, "create_weight_count": 1, "next_doc_count": 5056541, "score_count": 4665006, "build_scorer": 33387, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick:黃色", "time": "3.8ms", "time_in_nanos": 3872679, "breakdown": { "score": 136812, "build_scorer_count": 50, "match_count": 0, "create_weight": 5423, "next_doc": 3700537, "match": 0, "create_weight_count": 1, "next_doc_count": 923, "score_count": 755, "build_scorer": 28178, "advance": 0, "advance_count": 0 } } ] } ], "rewrite_time": 583986593, "collector": [ { "name": "CancellableCollector", "reason": "search_cancelled", "time": "730.3ms", "time_in_nanos": 730399762, "children": [ { "name": "SimpleTopScoreDocCollector", "reason": "search_top_hits", "time": "533.2ms", "time_in_nanos": 533238387 } ] } ] } ], "aggregations": [] },
因爲我採用的是HanLP分詞,用的這個分詞插件elasticsearch-analysis-hanlp,而採用ik_max_word分詞卻沒有相應的問題,這應該是分詞插件的bug,因而去github上提了一個issue,有興趣的能夠關注。看來我得去研究一下ElasticSearch Analyze整個流程的源碼以及加載插件的源碼了 ::(
以上是一個空格Term形成的查詢性能問題。在Profile分析時,還發現,使用SSD的Collector time比機械硬盤快10倍左右。
分片 [user_v1][0]
的 Collector time長達7.6秒,而這個分片所在機器的磁盤是機械硬盤。而上面那個分片[user_v1][3]
所在的磁盤是SSD,Collector time只有730.3ms。可見SSD與機械硬盤的在Collector time上相差10倍 。下面是分片[user_v1][0]
的profile查詢分析:
{ "id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]", "searches": [ { "query": [ { "type": "BooleanQuery", "description": "nick:微信 nick: nick:黃色", "time": "726.1ms", "time_in_nanos": 726190295, "breakdown": { "score": 339421458, "build_scorer_count": 48, "match_count": 0, "create_weight": 65012, "next_doc": 376526603, "match": 0, "create_weight_count": 1, "next_doc_count": 4935754, "score_count": 4665766, "build_scorer": 575653, "advance": 0, "advance_count": 0 }, "children": [ { "type": "TermQuery", "description": "nick:微信", "time": "63.2ms", "time_in_nanos": 63220487, "breakdown": { "score": 649184, "build_scorer_count": 61, "match_count": 0, "create_weight": 32572, "next_doc": 62398621, "match": 0, "create_weight_count": 1, "next_doc_count": 6759, "score_count": 5857, "build_scorer": 127432, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick: ", "time": "1m", "time_in_nanos": 60373841264, "breakdown": { "score": 60184752245, "build_scorer_count": 69, "match_count": 0, "create_weight": 5888, "next_doc": 179443959, "match": 0, "create_weight_count": 1, "next_doc_count": 4929373, "score_count": 4660228, "build_scorer": 49501, "advance": 0, "advance_count": 0 } }, { "type": "TermQuery", "description": "nick:黃色", "time": "528.1ms", "time_in_nanos": 528107489, "breakdown": { "score": 141744, "build_scorer_count": 43, "match_count": 0, "create_weight": 4717, "next_doc": 527942227, "match": 0, "create_weight_count": 1, "next_doc_count": 967, "score_count": 780, "build_scorer": 17010, "advance": 0, "advance_count": 0 } } ] } ], "rewrite_time": 993826311, "collector": [ { "name": "CancellableCollector", "reason": "search_cancelled", "time": "7.8s", "time_in_nanos": 7811511525, "children": [ { "name": "SimpleTopScoreDocCollector", "reason": "search_top_hits", "time": "7.6s", "time_in_nanos": 7616467158 } ] } ] } ], "aggregations": [] },
查詢不單單與Segment數量、Collector time等有關,還與索引的mapping定義,查詢方式(match、filter、term……)有關,可用Profile API分析查詢性能問題。另外也有一些壓測工具,好比:esrally
對於中文而言,還要注意 query string 被Analyze成各個token以後,究竟是針對了哪些Token查詢,這個能夠經過term vector進行測試,但生產環境通常不會開啓term vector。所以,中文分詞算法對搜索命中會有影響。
而至於搜索排序,可先用explain API 分析各個Term的得分,而後也可考慮ES的Function Score功能,針對某些特定的field作調節(field_value_factor),甚至能夠用機器學習模型優化搜索排序(learning to rank)
關於ElasticSearch查詢效率的提高一些思考: