Elasticsearch - 深刻搜索

時間 2019-11-16

標籤 elasticsearch 深刻搜索欄目日誌分析简体版

原文原文鏈接

結構化搜索

結構化搜索（Structured search）是指有關探詢那些具備內在結構數據的過程sql

在結構化查詢中，咱們獲得的結果老是 非是即否，要麼存於集合之中，要麼存在集合以外。
結構化查詢不關心文件的相關度或評分；它簡單的對文檔包括或排除處理。app

精確值查找

當進行精確值查找時，咱們會使用過濾器（filters）。code

term 查詢數字

咱們首先來看最爲經常使用的 term 查詢，能夠用它處理數字（numbers）、布爾值（Booleans）、日期（dates）以及文本（text）。索引

讓咱們如下面的例子開始介紹，建立並索引一些表示產品的文檔，文檔裏有字段 price 和 productID （ 價格 和 產品ID ）：

POST /my_store/products/_bulk
{ "index": { "_id": 1 }}
{ "price" : 10, "productID" : "XHDK-A-1293-#fJ3" }
{ "index": { "_id": 2 }}
{ "price" : 20, "productID" : "KDKE-B-9947-#kL5" }
{ "index": { "_id": 3 }}
{ "price" : 30, "productID" : "JODL-X-1937-#pV7" }
{ "index": { "_id": 4 }}
{ "price" : 30, "productID" : "QQPX-R-3956-#aD8" }

sql：查找具備某個價格的全部產品token

SELECT document FROM   products WHERE  price = 20

DSL:文檔

GET /my_store/products/_search
{
 "query": {"term": {
   "price": {
     "value": "20"
   }
 }}
}

GET /my_store/products/_search
{
 "query": {
   "term": {
   "price": "20"
 }}
}

一般當查找一個精確值的時候，咱們不但願對查詢進行評分計算。只但願對文檔進行包括或排除的計算，因此咱們會使用 constant_score 查詢以非評分模式來執行 term 查詢並以一做爲統一評分產品

GET /my_store/products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "price": "20"
        }
      },
      "boost": 1.2
    }
  }  
}

term 查詢文本

sql：it

SELECT product FROM   products WHERE  productID = "XHDK-A-1293-#fJ3"

DSL：io

GET /my_store/products/_search
{
  "query": {
    "constant_score": {
      "filter": {
        "term": {
          "productID": "XHDK-A-1293-#fJ3"
        }
      },
      "boost": 1.2
    }
  }
}

但這裏有個小問題：咱們沒法得到指望的結果。爲何呢？問題不在 term 查詢，而在於索引數據的方式,
若是咱們使用 analyze API (分析 API)，咱們能夠看到這裏的 UPC 碼被拆分紅多個更小的 token.ast

GET /my_store/_analyze
{
  "field": "productID",
  "text": "XHDK-A-1293-#fJ3"
}

{
  "tokens" : [ {
    "token" :        "xhdk",
    "start_offset" : 0,
    "end_offset" :   4,
    "type" :         "<ALPHANUM>",
    "position" :     1
  }, {
    "token" :        "a",
    "start_offset" : 5,
    "end_offset" :   6,
    "type" :         "<ALPHANUM>",
    "position" :     2
  }, {
    "token" :        "1293",
    "start_offset" : 7,
    "end_offset" :   11,
    "type" :         "<NUM>",
    "position" :     3
  }, {
    "token" :        "fj3",
    "start_offset" : 13,
    "end_offset" :   16,
    "type" :         "<ALPHANUM>",
    "position" :     4
  } ]
}

這裏有幾點須要注意：

Elasticsearch 用 4 個不一樣的 token 而不是單個 token 來表示這個 UPC 。
全部字母都是小寫的。
丟失了連字符和哈希符（ # ）

因此當咱們用 term 查詢查找精確值 XHDK-A-1293-#fJ3 的時候，找不到任何文檔，由於它並不在咱們的倒排索引中，正如前面呈現出的分析結果，索引裏有四個 token 。

爲了不這種問題，咱們須要告訴 Elasticsearch 該字段具備精確值，要將其設置成 not_analyzed 無需分析的

DELETE /my_store 

PUT /my_store
{
  "mappings": {
    "products":{
      "properties": {
        "productID":{
          "type": "keyword"
        }
      }
    }
  }
}

GET /my_store/products/_search
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "term" : {
                    "productID" : "XHDK-A-1293-#fJ3"
                }
            }
        }
    }
}

組合過濾器

SQL：

SELECT product
FROM   products
WHERE  (price = 20 OR productID = "XHDK-A-1293-#fJ3")
  AND  (price != 30)

DSL：

must
全部的語句都必須（must）匹配，與 AND 等價。
must_not
全部的語句都不能（must not）匹配，與 NOT 等價。
should
至少有一個語句要匹配，與 OR 等價。

GET /my_store/products/_search
{
   "query" : {
            "bool" : {
              "should" : [
                 { "term" : {"price" : 20}}, 
                 { "term" : {"productID" : "XHDK-A-1293-#fJ3"}} 
              ],
              "must_not" : {
                 "term" : {"price" : 30} 
              }
           }
   }
}