一次 ElasticSearch 搜索優化

時間 2020-07-11

標籤一次 elasticsearch 搜索優化欄目日誌分析简体版

原文原文鏈接

一次 ElasticSearch 搜索優化

1. 環境

ES6.3.2，索引名稱 user_v1，5個主分片，每一個分片一個副本。分片基本都在11GB左右，GET _cat/shards/userhtml

一共有3.4億文檔，主分片總共57GB。java

Segment信息：curl -X GET "221.228.105.140:9200/_cat/segments/user_v1?v" >> user_v1_segmentpython

user_v1索引一共有404個段：git

cat user_v1_segment | wc -lgithub

404算法

處理一下數據，用Python畫個直方圖看看效果：api

sed -i '1d' file # 刪除文件第一行微信

awk -F ' ' '{print $7}' user_v1_segment >> docs_count # 選取感興趣的一列(docs.count 列)app

with open('doc_count.txt') as f:
    data=f.read()
docList = data.splitlines()
docNums = list(map(int,docList))
import matplotlib.pyplot as plt
plt.hist(docNums,bins=40,normed=0,facecolor='blue',edgecolor='black')

大概看一下每一個Segment中包含的文檔的個數。橫座標是：文檔數量，縱座標是：segment個數。可見：大部分的Segment中只包含了少許的文檔($0.5*10^7$)curl

修改refresh_interval爲30s，原來默認爲1s，這樣能在必定程度上減小Segment的數量。而後先force merge將404個Segment減小到200個：

POST /user_v1/_forcemerge?only_expunge_deletes=false&max_num_segments=200&flush=true

可是一看，仍是有312個Segment。這個可能與merge的配置有關了。有興趣的能夠了解一下 force merge 過程當中這2個參數的意義：

merge.policy.max_merge_at_once_explicit
merge.scheduler.max_merge_count

執行profile分析：

1，Collector 時間過長，有些分片耗時長達7.9s。關於Profile 分析，可參考：profile-api

2，採用HanLP 分詞插件，Analyzer後獲得Term，竟然有"空格Term"，而這個Term的匹配長達800ms！

來看看緣由：

POST /_analyze { "analyzer": "hanlp_standard", "text":"人生如夢" }

分詞結果是包含了空格的：

{
  "tokens": [
    {
      "token": "人生",
      "start_offset": 0,
      "end_offset": 2,
      "type": "n",
      "position": 0
    },
    {
      "token": " ",
      "start_offset": 0,
      "end_offset": 1,
      "type": "w",
      "position": 1
    },
    {
      "token": "如",
      "start_offset": 0,
      "end_offset": 1,
      "type": "v",
      "position": 2
    },
    {
      "token": "夢",
      "start_offset": 0,
      "end_offset": 1,
      "type": "n",
      "position": 3
    }
  ]
}

那實際文檔被Analyzer了以後是否存儲了空格呢？

因而先定義一個索引，開啓term_vector。參考store term-vector

PUT user
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0
  },
  "mappings": {
    "profile": {
      "properties": {
        "nick": {
          "type": "text",
          "analyzer": "hanlp_standard",
          "term_vector": "yes", 
          "fields": {
            "raw": {
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}

而後PUT一篇文檔進去：

PUT user/profile/1
{
  "nick":"人生 如夢"
}

查看Term Vector：docs-termvectors

GET /user/profile/1/_termvectors
{
"fields" : ["nick"],
"offsets" : true,
"payloads" : true,
"positions" : true,
"term_statistics" : true,
"field_statistics" : true
}

發現存儲的Terms裏面有空格。

{
  "_index": "user",
  "_type": "profile",
  "_id": "1",
  "_version": 1,
  "found": true,
  "took": 2,
  "term_vectors": {
    "nick": {
      "field_statistics": {
        "sum_doc_freq": 4,
        "doc_count": 1,
        "sum_ttf": 4
      },
      "terms": {
        " ": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "人生": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "如": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        },
        "夢": {
          "doc_freq": 1,
          "ttf": 1,
          "term_freq": 1
        }
      }
    }
  }
}

而後再執行profile 查詢分析：

GET user/profile/_search?human=true
{
  "profile":true,
  "query": {
    "match": {
      "nick": "人生 如夢"
    }
  }
}

發現Profile裏面竟然有針對空格Term 的查詢！！！（注意 nick 後面有個空格）

"type": "TermQuery",
            "description": "nick: ",
            "time": "58.2micros",
            "time_in_nanos": 58244,

profile結果以下：

"profile": {
    "shards": [
      {
        "id": "[7MyDkEDrRj2RPHCPoaWveQ][user][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:人生 nick:  nick:如 nick:夢",
                "time": "642.9micros",
                "time_in_nanos": 642931,
                "breakdown": {
                  "score": 13370,
                  "build_scorer_count": 2,
                  "match_count": 0,
                  "create_weight": 390646,
                  "next_doc": 18462,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 2,
                  "score_count": 1,
                  "build_scorer": 220447,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:人生",
                    "time": "206.6micros",
                    "time_in_nanos": 206624,
                    "breakdown": {
                      "score": 942,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 167545,
                      "next_doc": 1493,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 36637,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "58.2micros",
                    "time_in_nanos": 58244,
                    "breakdown": {
                      "score": 918,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 46130,
                      "next_doc": 964,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 10225,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:如",
                    "time": "51.3micros",
                    "time_in_nanos": 51334,
                    "breakdown": {
                      "score": 888,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 43779,
                      "next_doc": 1103,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 5557,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:夢",
                    "time": "59.1micros",
                    "time_in_nanos": 59108,
                    "breakdown": {
                      "score": 3473,
                      "build_scorer_count": 3,
                      "match_count": 0,
                      "create_weight": 49739,
                      "next_doc": 900,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 2,
                      "score_count": 1,
                      "build_scorer": 4989,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 182090,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "25.9micros",
                "time_in_nanos": 25906,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "19micros",
                    "time_in_nanos": 19075
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      }
    ]
  }

而在實際的生產環境中，空格Term的查詢耗時480ms，而一個正常詞語（"微信"）的查詢，只有18ms。以下在分片[user_v1][3]上的profile分析結果：

"profile": {
    "shards": [
      {
        "id": "[8eN-6lsLTJ6as39QJhK5MQ][user_v1][3]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "888.6ms",
                "time_in_nanos": 888636963,
                "breakdown": {
                  "score": 513864260,
                  "build_scorer_count": 50,
                  "match_count": 0,
                  "create_weight": 93345,
                  "next_doc": 364649642,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 5063173,
                  "score_count": 4670398,
                  "build_scorer": 296094,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "18.4ms",
                    "time_in_nanos": 18480019,
                    "breakdown": {
                      "score": 656810,
                      "build_scorer_count": 62,
                      "match_count": 0,
                      "create_weight": 23633,
                      "next_doc": 17712339,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 7085,
                      "score_count": 5705,
                      "build_scorer": 74384,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "480.5ms",
                    "time_in_nanos": 480508016,
                    "breakdown": {
                      "score": 278358058,
                      "build_scorer_count": 72,
                      "match_count": 0,
                      "create_weight": 6041,
                      "next_doc": 192388910,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 5056541,
                      "score_count": 4665006,
                      "build_scorer": 33387,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "3.8ms",
                    "time_in_nanos": 3872679,
                    "breakdown": {
                      "score": 136812,
                      "build_scorer_count": 50,
                      "match_count": 0,
                      "create_weight": 5423,
                      "next_doc": 3700537,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 923,
                      "score_count": 755,
                      "build_scorer": 28178,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 583986593,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "730.3ms",
                "time_in_nanos": 730399762,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "533.2ms",
                    "time_in_nanos": 533238387
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },

因爲我採用的是HanLP分詞，用的這個分詞插件elasticsearch-analysis-hanlp，而採用ik_max_word分詞卻沒有相應的問題，這應該是分詞插件的bug，因而去github上提了一個issue，有興趣的能夠關注。看來我得去研究一下ElasticSearch Analyze整個流程的源碼以及加載插件的源碼了 ::(

以上是一個空格Term形成的查詢性能問題。在Profile分析時，還發現，使用SSD的Collector time比機械硬盤快10倍左右。

分片 [user_v1][0] 的 Collector time長達7.6秒，而這個分片所在機器的磁盤是機械硬盤。而上面那個分片[user_v1][3]所在的磁盤是SSD，Collector time只有730.3ms。可見SSD與機械硬盤的在Collector time上相差10倍。下面是分片[user_v1][0]的profile查詢分析：

{
        "id": "[wx0dqdubRkiqJJ-juAqH4A][user_v1][0]",
        "searches": [
          {
            "query": [
              {
                "type": "BooleanQuery",
                "description": "nick:微信 nick:  nick:黃色",
                "time": "726.1ms",
                "time_in_nanos": 726190295,
                "breakdown": {
                  "score": 339421458,
                  "build_scorer_count": 48,
                  "match_count": 0,
                  "create_weight": 65012,
                  "next_doc": 376526603,
                  "match": 0,
                  "create_weight_count": 1,
                  "next_doc_count": 4935754,
                  "score_count": 4665766,
                  "build_scorer": 575653,
                  "advance": 0,
                  "advance_count": 0
                },
                "children": [
                  {
                    "type": "TermQuery",
                    "description": "nick:微信",
                    "time": "63.2ms",
                    "time_in_nanos": 63220487,
                    "breakdown": {
                      "score": 649184,
                      "build_scorer_count": 61,
                      "match_count": 0,
                      "create_weight": 32572,
                      "next_doc": 62398621,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 6759,
                      "score_count": 5857,
                      "build_scorer": 127432,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick: ",
                    "time": "1m",
                    "time_in_nanos": 60373841264,
                    "breakdown": {
                      "score": 60184752245,
                      "build_scorer_count": 69,
                      "match_count": 0,
                      "create_weight": 5888,
                      "next_doc": 179443959,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 4929373,
                      "score_count": 4660228,
                      "build_scorer": 49501,
                      "advance": 0,
                      "advance_count": 0
                    }
                  },
                  {
                    "type": "TermQuery",
                    "description": "nick:黃色",
                    "time": "528.1ms",
                    "time_in_nanos": 528107489,
                    "breakdown": {
                      "score": 141744,
                      "build_scorer_count": 43,
                      "match_count": 0,
                      "create_weight": 4717,
                      "next_doc": 527942227,
                      "match": 0,
                      "create_weight_count": 1,
                      "next_doc_count": 967,
                      "score_count": 780,
                      "build_scorer": 17010,
                      "advance": 0,
                      "advance_count": 0
                    }
                  }
                ]
              }
            ],
            "rewrite_time": 993826311,
            "collector": [
              {
                "name": "CancellableCollector",
                "reason": "search_cancelled",
                "time": "7.8s",
                "time_in_nanos": 7811511525,
                "children": [
                  {
                    "name": "SimpleTopScoreDocCollector",
                    "reason": "search_top_hits",
                    "time": "7.6s",
                    "time_in_nanos": 7616467158
                  }
                ]
              }
            ]
          }
        ],
        "aggregations": []
      },