〈三〉ElasticSearch的認識：搜索、過濾、排序

時間 2019-11-06

標籤 elasticsearch 認識搜索過濾排序欄目日誌分析简体版

原文原文鏈接

發表日期：2019年9月20日spring

上節回顧

1.講了如何對索引CRUD
2.從新解釋了type，只是元數據的效果
3.講了如何對文檔CRUDexpress

本節前言

1.ElasticSearch的主要功能是搜索，這節也將會主要講搜索，將會涉及到如何使用關鍵字進行全文搜索
2.除了講搜索，也會講到搜索相關的「分頁」、「排序」、「聚合分析」等內容。
3.還會補充一些與搜索相關的知識。swift

文檔的搜索

測試數據：請先插入如下數據，以便練習搜索功能
【忽然看了一下以前的博文，發現我後面去準備數據的時候寫錯格式了。因此致使id爲1,2,3的文檔和後面的文檔的字段不同。你能夠僅僅基於如下的數據來測試】數組

PUT /douban/book/5
{
    "book_id":5,
    "book_name":"A Boy's Own Story",
    "book_author":"Edmund White",
    "book_pages":217,
    "book_express":"Vintage",
    "publish_date":"1994-02-01",
    "book_summary":"""

  An instant classic upon its original publication, A Boy's Own Story is the first of Edmund White's highly acclaimed trilogy of autobiographical novels that brilliantly evoke a young man's coming of age and document American gay life through the last forty years.
  
  The nameless narrator in this deeply affecting work reminisces about growing up in the 1950s with emotionally aloof, divorced parents, an unrelenting sister, and the schoolmates who taunt him. He finds consolation in literature and his fantastic imagination. Eager to cultivate intimate, enduring friendships, he becomes aware of his yearning to be loved by men, and struggles with the guilt and shame of accepting who he is. Written with lyrical delicacy and extraordinary power, A Boy's Own Story is a triumph."""
}


PUT /douban/book/6
{
    "book_id":6,
    "book_name":"The Lost Language of Cranes",
    "book_author":"David Leavitt",
    "book_pages":352,
    "book_express":"Bloomsbury Publishing PLC",
    "publish_date":"2005-05-02",
    "book_summary":"""David Leavitt's extraordinary first novel, now reissued in paperback, is a seminal work about family, sexual identity, home, and loss. Set in the 1980s against the backdrop of a swiftly gentrifying Manhattan, The Lost Language of Cranes tells the story of twenty-five-year-old Philip, who realizes he must come out to his parents after falling in love for the first time with a man. Philip's parents are facing their own crisis: pressure from developers and the loss of their longtime home. But the real threat to this family is Philip's father's own struggle with his latent homosexuality, realized only in his Sunday afternoon visits to gay porn theaters. Philip's admission to his parents and his father's hidden life provoke changes that forever alter the landscape of their worlds."""
}

PUT /douban/book/7
{
    "book_id":7,
    "book_name":"Immortality",
    "book_author":"Milan Kundera",
    "book_pages":400,
    "book_express":"Faber and Faber",
    "publish_date":"2000-08-21",
    "book_summary":"""Milan Kundera's sixth novel springs from a casual gesture of a woman to her swimming instructor, a gesture that creates a character in the mind of a writer named Kundera. Like Flaubert's Emma or Tolstoy's Anna, Kundera's Agnes becomes an object of fascination, of indefinable longing. From that character springs a novel, a gesture of the imagination that both embodies and articulates Milan Kundera's supreme mastery of the novel and its purpose: to explore thoroughly the great themes of existence."""
}

搜索的方式主要有兩種，URL搜索和請求體搜索，一個是將搜索的條件寫在URL中，一個是將請求寫在請求體中。緩存

URL參數條件搜索

語法：GET /index/type/_search?參數

less

參數解析：elasticsearch

q：使用某個字段來進行查詢，例如q:book_name=book，就是根據book_name中是否有book來進行搜索。
sort：使用某個字段來進行排序，例如sort=cost:desc，就是根據cost字段來進行降序desc排序。
其餘：fileds,timeout,analyzer【這些參數留在請求體搜索中講】
不帶參數時，爲「全搜索」
多個參數使用&&拼接

示例：ide

GET /douban/book/_search?q=book_summary:character
GET /douban/book/_search?q=book_author:Milan
GET /douban/book/_search?q=book_summary:a
GET /douban/book/_search?q=book_summary:a&&sort=book_pages:desc
GET /douban/book/_search?q=book_summary:a&&q=book_author:Milan
【值得注意的是，請先不要對text類型的數據進行排序，這會影響搜索，對整數排序便可，後面會再細講】

查詢結果解析：
【考慮到數據太長的問題，因此我給了另外一個搜索結果的返回截圖】
函數

補充：把搜索條件寫在url中的搜索方式比較少用，由於查詢參數拼接到URL中會比較麻煩。

請求體條件搜索

語法與示例：

//全搜索
GET /index/type/_search
GET /douban/book/_search

//全搜索
GET /index/type/_search
{
  "query": {
    "match_all": {}
  }
}
GET /douban/book/_search
{
  "query": {
    "match_all": {}
  }
}

// 查詢指定字段的數據（全文搜索，若是搜索值有多個詞，僅匹配一個詞的結果也能夠查詢出來）：
GET /index/type/_search
{
  "query": {
    "match": {
      "字段名": "搜索值"
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "match": {
      "book_name": "A The"
    }
  }
}


// 使用同一搜索值搜索多個字段：
GET /index/type/_search
{
  "query": {
    "multi_match": {
      "query": "搜索值",
      "fields": [
        "搜索的字段1","搜索的字段2"]
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "multi_match": {
      "query": "A",
      "fields": [
        "book_name","book_summary"]
    }
  }
}

// 短語查詢：【搜索值必須徹底匹配，不會把搜索值拆分來搜索】
GET /index/type/_search
{
  "query": {
    "match_phrase": {
      "字段": "搜索值"
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "match_phrase": {
      "book_summary": "a character"
    }
  }
}

// 字段過濾，查詢的結果只顯示指定字段
GET /product/book/_search
{
  "query": {
    "查詢條件"
  },
  "_source": [
    "顯示的字段1",
    "顯示的字段2"
    ]
}
GET /douban/book/_search
{
  "query": {
    "match": {
      "book_name": "Story"
    }
  },
  "_source": [
    "book_name",
    "book_id"
    ]
}

// 高亮查詢：【根據查詢的關鍵字來進行高亮,高亮的結果會顯示在返回結果的會自動在返回結果中的highlight中，關鍵字會被加上<em>標籤】
// 若是想要多字段高亮，也須要進行多字段搜索
GET /index/book/_search
{
  "query": {
    "查詢條件"
  },
  "highlight": {
    "fields": {
      "高亮的字段名1": {}
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "match": {
      "book_summary": "Story"
    }
  },
  "highlight": {
    "fields": {
      "book_summary":{}
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "multi_match": {
      "query": "Story",
      "fields": [
        "book_name","book_summary"]
    }

  },
  "highlight": {
    "fields": {
      "book_summary":{},
      "book_name":{}
    }
  }
}

上面展現了關於全搜索、單字段值全文搜索、多字段單一搜索值全文搜索、短語搜索、字段過濾、高亮搜索的代碼。

因爲對多個字段使用不一樣搜索值涉及條件拼接，因此單獨講。
前置知識講解：對於條件拼接，在SQL中有and,or,not，在ElasticSearch不太同樣，下面逐一講解：

bool:用來代表裏面的語句是多條件的組合，用來包裹多個條件。
should:裏面能夠有多個條件，查詢結果必須符合查詢條件中的一個或多個。
must：裏面的多個條件都必須成立
must_not：裏面的多個條件必須不成立

示例:

// 書名必須包含Story的
GET /douban/book/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match":{
            "book_name":"Story"
          }
        }
      ]
    }
  }
}

// 書名必須不包含Story的
GET /douban/book/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match":{
            "book_name":"Story"
          }
        }
      ]
    }
  }
}

// 書名必須不包含Story,書名包含Adventures或Immortality的
GET /douban/book/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match":{
            "book_name":"Story"
          }
        }
      ],
      "should": [
        {
          "match": {
            "book_name": "Adventures"
          }
        },
        {
          "match": {
            "book_name": "Immortality"
          }
        }
      ]
    }
  }
}


// 在should、must、must_not這些裏面均可以放多個條件
GET /douban/book/_search
{
  "query": {
    "bool": {
      "must_not": [
        {
          "match":{
            "book_name":"Story"
          }
        },
        {
          "match": {
            "book_name": "Adventures"
          }
        }
      ]
    }
  }
}

// 若是是單個條件的時候，還能夠這樣寫，省去[]：
GET /douban/book/_search
{
  "query": {
    "bool": {
      "must_not": {
          "match":{
            "book_name":"Story"
          }
      }
    }
  }
}

// 還能夠條件嵌套，也就是再嵌套一層bool，不過要注意邏輯，例如：
// 查詢出（書名有story）或者（書名有The並且做者名有David）的，第二個是可成立可不成立的。
GET /douban/book/_search
{
  "query": {
    "bool": {
      "should": [
        {
          "match": {
            "book_name": "Story"
          }
        },
        {
          "bool": {
            "must": [
              {
                "match": {
                  "book_name": "The"
                }
              },
              {
                "match": {
                  "book_author": "David"
                }
              }
            ]
          }
        }
      ]
    }
  }
}

補充：

上面講了URL參數條件搜索和請求體條件搜索，講了全搜索、單字段值全文搜索、多字段單一搜索值全文搜索、短語搜索、字段過濾、高亮搜索的使用方法，還講了基於bool、should、must、must_not的多條件搜索，上面的知識已經能基礎地實現一些搜索功能了。但仍是有一些知識因爲比較晦澀，因此留到後面章節講，好比給搜索指定分詞器、給多條件指定匹配數量、滾動查詢等等。

小節總結：

上面講了URL參數條件搜索和請求體條件搜索。URL參數條件寫在URL裏面，用?來附帶參數，q用來指定搜索字段。請求體參數把條件寫在請求體中，query是最外層的包裹，match_all用於查詢全部，match用來使用指定搜索值搜索某一字段，match_phrase用來搜索連續的搜索，_source用來字段過濾（與query同級，[]裏面是字段名），highlight用來高亮搜索（與query同級，裏面是{field:{字段名1:{},字段名2:{}}}），bool、should、must、must_not用來多條件搜索。

文檔的過濾filter

過濾的效果其實有點像條件搜索，不過條件搜索會考慮相關度分數和考慮分詞，而過濾是不考慮這些的，過濾對相關度沒有影響。過濾通常用於結構化的數據上，也就是一般不用於使用了分詞的數據上，一般都會用在數值類型和日期類型的數據上。

在搜索的時候，若是你不但願要搜索的條件會影響到相關度，那麼就把它放在過濾中，若是但願影響相關度，那麼就放在條件搜索中。
使用過濾時，因爲不考慮相關度，因此score固定爲1。

文檔的過濾filter裏面主要有五種字段，range,term,terms,exist,missing。range用於字段數據比較大小；term主要用於比較字符類型的和數值類型的數據是否相等；terms是term的複數版，裏面能夠有多個用於比較相等的值；exist和missing用於判斷文檔中是否包含指定字段或沒有某個字段(僅適用於2.0+版本，目前已經移除)

語法與舉例：

// range，gte是不小於，lte是不大於，eq是等於,gt是大於，lt是小於
GET / index/type/_search
{
  "query": {
    "range": {
      "字段名": {
        "gte": 比較值
        [,"lte": 比較值]
      }
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "range": {
      "book_pages": {
        "gte": 352,
        "lt":400
      }
    }
  }
}


// term用於匹配字符串和數值型類型的數據（解決了range中沒有eq的問題），但不能直接用於分詞的字段。
//【這個並無那麼簡單，會後續再講，直接匹配一些會分詞的字段時，會匹配失敗，
//由於這時候這個字段拿來匹配的都是散亂的值，不是完整的本來的字段數據，因此下面用了不分詞的數值型的字段來演示】
GET /douban/book/_search
{
  "query": {
    "term": {
      "字段": "搜索值"
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "term": {
      "book_pages": 352
    }
  }
}


//terms
GET /douban/book/_search
{
  "query": {
    "terms": {
      "字段": ["搜索值1","搜索值2"]
    }
  }
}
GET /douban/book/_search
{
  "query": {
    "terms": {
      "book_pages": [
        "352",
        "400"
      ]
    }
  }
}

term的問題:

首先，提一下的是，在搜索的時候，你並不直接面向原始文檔數據，而是面向倒排索引，這意思是什麼呢？好比你要進行全文搜索，那麼你的搜索值並非與數據文件比對的，而是與倒排索引匹配的，也就是在咱們與數據文件之間有一個專門用於搜索的層次。

對於match和match_all，這些都是全文搜索，就不說了，直接就是經過索引詞在索引文件中找到對應的文檔；比較不一樣的是match_phrase這個會匹配一段詞的搜索，他是怎麼查詢的呢？他實際上也會去查索引文件中包括了搜索值中全部詞而且詞的在文檔中的位置順序也一致的記錄，因此這個短語匹配其實也是經過倒排索引來搜索的。

而倒排索引中其實包含了全部字段的標識，對於分詞的字段，會存儲索引詞；對於不分詞的，會存儲整個數據。【對於分詞的字段能夠加一個keyword來保留完整的數據，這個後面再講。】

而term的搜索主要面向不分詞的數據，因此沒法直接用於分詞的字段，除非加keyword。
官方文檔中關於term

filter與bool

filter也能夠用於多條件拼接。例如：

GET /douban/book/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match":{
            "book_name":"Story"
          }
        },
        {
          "range": {
            "book_pages": {
              "lte":300
              }
          }
        }
      ]
    }
  }
}

GET /douban/book/_search
{
  "query": {
    "bool": {
      "must": [
        {
          "match":{
            "book_name":"Story"
          }
        },
        {
          "range": {
            "book_pages": {
              "lte":300
              }
          }
        },
        {
          "term": {
            "publish_date": "1994-02-01"
          }
        }
      ]
    }
  }
}

在這樣條件搜索和過濾一塊兒用的狀況下，要注意filter過濾是不計算相關度的，在上面中，假設只有match，那麼某個文檔相關度爲0.2，加上filter後，會變成1.2。由於filter默認提供的相關度爲1。

constant_score

過濾還能夠這樣寫：

GET /douban/book/_search
{
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "book_pages": {
                        "gte": 352,
                        "lt": 400
                    }
                }
            }
        }
    }
}

// boost設置filter提供的相關度score值
GET /douban/book/_search
{
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "book_pages": {
                        "gte": 352,
                        "lt": 400
                    }
                }
            },
            "boost": 1.2
        }
    }
}

cache

對於過濾，elasticsearch會臨時緩存它的結果，以即可能下次仍需使用它。由於過濾是不關心相關度的。
官方文檔--過濾緩存

小節總結：

這節介紹了不影響相關度的搜索--過濾，過濾一般用於過濾結構化數據，也就是那些不分詞的數據，其中range用於數值範圍過濾，term用於字符類型的數據或數值類型的數據的值是否相等，terms是term的複數版。過濾也支持bool拼接多個條件。過濾提供的相關度分數是一個常數，默認是1。

文檔的聚合分析

準備數據

先準備一批測試數據：

PUT /people/test/1
{
  "name":"lilei1",
  "age":18,
  "gender":1
}
PUT /people/test/2
{
  "name":"lilei2",
  "age":17,
  "gender":0
}

PUT /people/test/3
{
  "name":"lilei4",
  "age":21,
  "gender":1
}

PUT /people/test/4
{
  "name":"lilei4",
  "age":15,
  "gender":0
}
PUT /people/test/5
{
  "name":"lilei1 2",
  "age":15,
  "gender":0
}

像在SQL中會須要SUM(),MAX().AVG()函數。ElasticSearch也提供了關於聚合分析的函數。

ElasticSearch中常見的聚合分析函數有terms（分組函數）、avg（平均數）、range（區間分組）、max（求最大值）、min（求最小值）、cardinality（獲取惟一值的數量）、value_count(獲取值的數量，不去重，能夠得出多少個值參與了聚合)。

語法與舉例：

語法：

GET /index/type/_search
{
  "aggs": {
    "自定義聚合名稱": {
      "聚合函數": {
        聚合參數
      }
    }
  }
}

舉例：

// 按性別分組
GET /douban/book/_search
{
  "aggs": {
    "groud_by_express": {
      "terms": {
        "field": "book_id",
        "size": 10
      }
    }
  }
}
//求年齡的平均數
GET /people/test/_search
{
  "aggs": {
    "avg_of_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}
// 求年齡的最大值：
GET /people/test/_search
{
  "aggs": {
    "max_of_age": {
      "max": {
        "field": "age"
      }
    }
  }
}
// 把年齡[15,17]的分紅一組，把年齡[18,25]的分紅一組
GET /people/test/_search
{
  "aggs": {
    "range_by_age": {
      "range": {
        "field": "age",
        "ranges": [
          {
            "from": 15,
            "to": 17
          },
          {
            "from": 18,
            "to": 25
          }
        ]
      }
    }
  }
}
// 獲取不一樣的年齡數：，好比有年齡[1,2,3,3,4,5],獲得的結果是5，由於3只算一次
GET /people/test/_search
{
  "aggs": {
    "get_diff_age_count": {
      "cardinality": {
        "field": "age"
      }
    }
  }
}

返回結果解析:

其餘語法：

先查詢後聚合：

GET /people/test/_search
{
  "query": {
    "match": {
      "name": "lilei1"
    }
  }, 
  "aggs": {
    "avg_of_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

先過濾後聚合：

// 先獲取年齡大於15的，再求平均值
GET /people/test/_search
{
  "query": {
    "range": {
      "age": {
        "gt":15
      }
    }
  }, 
  "aggs": {
    "avg_of_age": {
      "avg": {
        "field": "age"
      }
    }
  }
}

聚合函數嵌套：

// 先按性別分組，再獲取年齡平均值
GET /people/test/_search
{
    "aggs": {
        "groud_by_express": {
            "terms": {
                "field": "gender"
            },
            "aggs": {
                "avg_of_age": {
                    "avg": {
                        "field": "age"
                    }
                }
            }
        }
    }
}

聚合+排序：

// 先按性別分組，再按分組的年齡平均值降序排序，order中的avg_of_age就是下面的聚合函數的自定義名稱
GET /people/test/_search
{
    "aggs": {
        "groud_by_express": {
            "terms": {
                "field": "gender",
                "order": {
                  "avg_of_age": "desc"
                }
            },
            "aggs": {
                "avg_of_age": {
                    "avg": {
                        "field": "age"
                    }
                }
            }
        }
    }
}

補充：

上面只講了一些基礎的聚合，聚合分析是一個比較重要的內容，會在後面的再講。

小節總結：

本節主要講了ElasticSearch中關於數據聚合的使用方法，aggs是與query同級的，使用聚合函數須要本身定義一個外層的聚合函數名稱，avg用於求平均值，max用於求最大值，range用於範圍分組，term用於數據分組。分組能夠與條件搜索和過濾一塊兒使用，aggs是與query同級的，聚合函數也能夠嵌套使用。

文檔的分頁、排序

【使用一下上一節準備的數據】

分頁

// 從第一條開始，獲取兩條數據
GET /people/test/_search
{
  "from": 0,
  "size": 2
}

// 能夠先查詢，再分頁
GET /people/test/_search
{
  "query": {
    "match": {
      "name": "lilei1"
    }
  }, 
  "from": 0,
  "size": 1
}

排序

【請注意，下面的結果中你能夠看到score爲null，由於這時候你使用了age字段來排序，而不是相關性，因此此時相關性意義不大，則不計算。】

排序處理：【sort與query同級別，是一個數組，裏面能夠有多個排序參數，參數以{"FIELD":{"order":"desc/asc"}}爲格式】
GET /people/test/_search
{
  "query": {
    "match_all": {}
  }, 
  "sort": [
    {
      "age": {
        "order": "desc"
      }
    }
  ]
}

deep paging

對於分頁和排序，須要共同面對一個問題：

首先，你要想到一個索引中的數據是散落在多個分片上的，你如何肯定另外一個分片上的數據與其餘分片上的順序問題？，好比可能A分片上的有數值爲1和數值爲3的數據,而B分片上有數值爲2和數值爲4的數據,因此B分片的部分數據與A分片數據的大小是不肯定的。那麼排序的時候怎麼處理這些散落的數據呢？（就算是依據相對度來排序，這個時候散落的數據的相關度也是不太好肯定的）
分頁須要面對的問題也一樣是由於數據散落的而很差排序的問題，由於分頁也是要排序的，默認是按相關度排序。由於散落的數據的值的大小不肯定，因此就須要把全部可能的數據取出來排完序再分頁，這就會致使須要取出遠遠超出「頁數」的數據來計算。

有兩個primary shard（命名爲A和B），如今要取第1000頁的數據，假設每一頁10條記錄，那麼理論上是隻須要取第10000到第10010條記錄出來便可。

但這時候咱們並不知道A和B中的_score的大小如何，可能A中的最小的_score要比B中的最大的_score都要大，反過來也有可能，（因此咱們並不能說僅僅從A和B中分別取10000到10010出來進行比較便可，咱們須要對前面的數據都進行比較，以免最小的_score都比另外一個shard上的_score大的狀況），爲了確保數據的正確性，咱們須要從A和B中都取出1到10010的數據來進行排序比較，而後再取出裏面的10000到10010條。

因此，你看到了，咱們只是爲了拿十條數據，居然要查10010數據出來。這就是deep paging了。