一次 ElasticSearch 搜索優化

一次 ElasticSearch 搜索優化

1. 環境

ES6.3.2,索引名稱 user_v1,5個主分片,每一個分片一個副本。分片基本都在11GB左右,GET _cat/shards/userhtml

一共有3.4億文檔,主分片總共57GB。java

Segment信息:curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segmentpython

user_v1索引一共有404個段:git

cat user_v1_segment | wc -lgithub

404算法

處理一下數據,用Python畫個直方圖看看效果:api

sed -i '1d' file # 刪除文件第一行微信

awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 選取感興趣的一列(docs.count 列)app

with open('doc_count.txt') as f:
    data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')

大概看一下每一個Segment中包含的文檔的個數。橫座標是:文檔數量,縱座標是:segment個數。可見:大部分的Segment中只包含了少許的文檔($0.5*10^7$)curl

修改refresh_interval爲30s,原來默認爲1s,這樣能在必定程度上減小Segment的數量。而後先force merge將404個Segment減小到200個:

POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true

可是一看,仍是有312個Segment。這個可能與merge的配置有關了。有興趣的能夠了解一下 force merge 過程當中這2個參數的意義:

  • ​ merge.policy.max_merge_at_once_explicit
  • ​ merge.scheduler.max_merge_count

執行profile分析:

1,Collector 時間過長,有些分片耗時長達7.9s。關於Profile 分析,可參考:profile-api

2,採用HanLP 分詞插件,Analyzer後獲得Term,竟然有"空格Term",而這個Term的匹配長達800ms!

來看看緣由:

POST /_analyze { "analyzer": "hanlp_standard", "text":"人生 如夢" }

分詞結果是包含了空格的:

{
  "tokens": [
    {
      "token": "人生",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 0
    },
    {
      "token": " ",
      "start_offset": 0,
      "end_offset": 1,
      "type": "w",
      "position": 1
    },
    {
      "token": "如",
      "start_offset": 0,
      "end_offset": 1,
      "type": "v",
      "position": 2
    },
    {
      "token": "夢",
      "start_offset": 0,
      "end_offset": 1,
      "type": "n",
      "position": 3
    }
  ]
}

那實際文檔被Analyzer了以後是否存儲了空格呢?

因而先定義一個索引,開啓term_vector。參考store term-vector

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

而後PUT一篇文檔進去:

PUT user/profile/1
{
  "nick":"人生 如夢"
}

查看Term Vector:docs-termvectors

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

發現存儲的Terms裏面有空格。

{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 2,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "夢": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

而後再執行profile 查詢分析:

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如夢"
    }
  }
}

發現Profile裏面竟然有針對 空格Term 的查詢!!!(注意 nick 後面有個空格)

"type": "TermQuery",
            "description": "nick: ",
            "time": "58.2micros",
            "time_in_nanos": 58244,

profile結果以下:

"profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:  nick:如 nick:夢",
                "time": "642.9micros",
                "time_in_nanos": 642931,
                "breakdown": {
                  "score": 13370,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 390646,
                  "next_doc": 18462,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 220447,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "206.6micros",
                    "time_in_nanos": 206624,
                    "breakdown": {
                      "score": 942,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 167545,
                      "next_doc": 1493,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 36637,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "58.2micros",
                    "time_in_nanos": 58244,
                    "breakdown": {
                      "score": 918,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 46130,
                      "next_doc": 964,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 10225,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如",
                    "time": "51.3micros",
                    "time_in_nanos": 51334,
                    "breakdown": {
                      "score": 888,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 43779,
                      "next_doc": 1103,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5557,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:夢",
                    "time": "59.1micros",
                    "time_in_nanos": 59108,
                    "breakdown": {
                      "score": 3473,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 49739,
                      "next_doc": 900,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 4989,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 182090,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "25.9micros",
                "time_in_nanos": 25906,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "19micros",
                    "time_in_nanos": 19075
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }

而在實際的生產環境中,空格Term的查詢耗時480ms,而一個正常詞語("微信")的查詢,只有18ms。以下在分片[user_v1][3]上的profile分析結果:

"profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "18.4ms",
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "480.5ms",
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "3.8ms",
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 583986593,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "730.3ms",
                "time_in_nanos": 730399762,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "533.2ms",
                    "time_in_nanos": 533238387
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

因爲我採用的是HanLP分詞,用的這個分詞插件elasticsearch-analysis-hanlp,而採用ik_max_word分詞卻沒有相應的問題,這應該是分詞插件的bug,因而去github上提了一個issue,有興趣的能夠關注。看來我得去研究一下ElasticSearch Analyze整個流程的源碼以及加載插件的源碼了 ::(

以上是一個空格Term形成的查詢性能問題。在Profile分析時,還發現,使用SSD的Collector time比機械硬盤快10倍左右。

分片 [user_v1][0] 的 Collector time長達7.6秒,而這個分片所在機器的磁盤是機械硬盤。而上面那個分片[user_v1][3]所在的磁盤是SSD,Collector time只有730.3ms。可見SSD與機械硬盤的在Collector time上相差10倍 。下面是分片[user_v1][0]的profile查詢分析:

{
        "id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "726.1ms",
                "time_in_nanos": 726190295,
                "breakdown": {
                  "score": 339421458,
                  "build_scorer_count": 48,
                  "match_count": 0,
                  "create_weight": 65012,
                  "next_doc": 376526603,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 4935754,
                  "score_count": 4665766,
                  "build_scorer": 575653,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "63.2ms",
                    "time_in_nanos": 63220487,
                    "breakdown": {
                      "score": 649184,
                      "build_scorer_count": 61,
                      "match_count": 0,
                      "create_weight": 32572,
                      "next_doc": 62398621,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 6759,
                      "score_count": 5857,
                      "build_scorer": 127432,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "1m",
                    "time_in_nanos": 60373841264,
                    "breakdown": {
                      "score": 60184752245,
                      "build_scorer_count": 69,
                      "match_count": 0,
                      "create_weight": 5888,
                      "next_doc": 179443959,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 4929373,
                      "score_count": 4660228,
                      "build_scorer": 49501,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "528.1ms",
                    "time_in_nanos": 528107489,
                    "breakdown": {
                      "score": 141744,
                      "build_scorer_count": 43,
                      "match_count": 0,
                      "create_weight": 4717,
                      "next_doc": 527942227,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 967,
                      "score_count": 780,
                      "build_scorer": 17010,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 993826311,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "7.8s",
                "time_in_nanos": 7811511525,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "7.6s",
                    "time_in_nanos": 7616467158
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

結論

查詢不單單與Segment數量、Collector time等有關,還與索引的mapping定義,查詢方式(match、filter、term……)有關,可用Profile API分析查詢性能問題。另外也有一些壓測工具,好比:esrally

對於中文而言,還要注意 query string 被Analyze成各個token以後,究竟是針對了哪些Token查詢,這個能夠經過term vector進行測試,但生產環境通常不會開啓term vector。所以,中文分詞算法對搜索命中會有影響。

而至於搜索排序,可先用explain API 分析各個Term的得分,而後也可考慮ES的Function Score功能,針對某些特定的field作調節(field_value_factor),甚至能夠用機器學習模型優化搜索排序(learning to rank)

關於ElasticSearch查詢效率的提高一些思考:

  1. FileSystem cache 要足夠(堆外內存 vs 堆外內存),數據分佈要合理(冷熱分離)
  2. 索引設計要合理(多字段、Analyzer、Index shard數量)、Segment數量(refresh interval 配置)
  3. 查詢語法要合適(term、match、filter),可經過搜索參數調優(terminate_after提早返回、timeout查詢響應超時)
  4. profile分析

參考資料

相關文章
相關標籤/搜索