搜索引擎（Elasticsearch搜索詳解）

時間 2019-11-11

原文原文鏈接

學完本課題，你應達成以下目標：

掌握ES搜索API的規則、用法。
掌握各類查詢用法html

搜索API

搜索API 端點地址java

GET /twitter/_search?q=user:kimchy

GET /twitter/tweet,user/_search?q=user:kimchy

GET /kimchy,elasticsearch/_search?q=tag:wow

GET /_all/_search?q=tag:wow

GET /_search?q=tag:wow

搜索的端點地址能夠是多索引多mapping type的。搜索的參數可做爲URI請求參數給出，也可用 request body 給出。正則表達式

URI Searchspring

URI 搜索方式經過URI參數來指定查詢相關參數。讓咱們能夠快速作一個查詢。apache

GET /twitter/_search?q=user:kimchy

可用的參數請參考： https://www.elastic.co/guide/en/elasticsearch/reference/current/search-uri-request.htmlapi

URI中容許的參數是：
q 
查詢字符串（映射到query_string查詢，請參閱 查詢字符串查詢以獲取更多詳細信息）。
df 
在查詢中未定義字段前綴時使用的默認字段。
analyzer 
分析查詢字符串時要使用的分析器名稱。
analyze_wildcard 
是否應該分析通配符和前綴查詢。默認爲false。
batched_reduce_size 
一次在協調節點上應該減小的分片結果的數量。若是請求中的潛在分片數量可能很大，則應將此值用做保護機制以減小每一個搜索請求的內存開銷。
default_operator 
要使用的默認運算符能夠是AND或 OR。默認爲OR。
lenient 
若是設置爲true，則會致使基於格式的失敗（如向數字字段提供文本）被忽略。默認爲false。
explain 
對於每一個命中，包含如何計算命中得分的解釋。
_source 
設置爲false禁用檢索_source字段。您也可使用_source_include＆獲取部分文檔_source_exclude（請參閱請求主體 文檔以獲取更多詳細信息）
stored_fields 
選擇性存儲的文件字段爲每一個命中返回，逗號分隔。沒有指定任何值將致使沒有字段返回。
sort 
排序以執行。能夠是fieldName，或者是 fieldName:asc的形式fieldName:desc。fieldName能夠是文檔中的實際字段，也能夠是_score根據分數表示排序的特殊名稱。能夠有幾個sort參數（順序很重要）。
track_scores 
排序時，設置爲true仍然能夠跟蹤分數並將它們做爲每次擊中的一部分返回。
timeout 
搜索超時，限制在指定時間值內執行的搜索請求，並在到期時積累至該點的保留時間。默認沒有超時。
terminate_after 
爲每一個分片收集的文檔的最大數量，一旦達到該數量，查詢執行將提早終止。若是設置，則響應將有一個布爾型字段terminated_early來指示查詢執行是否實際已經terminate_early。缺省爲no terminate_after。
from 
從命中的索引開始返回。默認爲0。
size 
要返回的點擊次數。默認爲10。
search_type 
要執行的搜索操做的類型。能夠是 dfs_query_then_fetch或query_then_fetch。默認爲query_then_fetch。有關能夠執行的不一樣搜索類型的更多詳細信息，請參閱 搜索類型。

查詢結果說明數組

{
    "took": 1,               耗時（毫秒）
    "timed_out": false,      是否超時
    "_shards":{              查詢了多少個分片
        "total" : 1,
        "successful" : 1,
        "skipped" : 0,
        "failed" : 0
    },
    "hits":{                 命中結果
        "total" : 1,         總命中數
        "max_score": 1.3862944,  最高得分
        "hits" : [                本頁結果文檔數組
            {
                "_index" : "twitter",  文檔
                "_type" : "_doc",
                "_id" : "0",
                "_score": 1.3862944,
                "_source" : {
                    "user" : "kimchy",
                    "message": "trying out Elasticsearch",
                    "date" : "2009-11-15T14:12:12",
                    "likes" : 0
                }            }        ]    }}

特殊的查詢參數用法緩存

若是咱們只想知道有多少文檔匹配某個查詢，能夠這樣用參數：併發

GET /bank/_search?q=city:b*&size=0

若是咱們只想知道有沒有文檔匹配某個查詢，能夠這樣用參數：app

GET /bank/_search?q=city:b*&size=0&terminate_after=1

比較兩個查詢的結果，有什麼區別。

Request body Search

Request body 搜索方式以JSON格式在請求體中定義查詢 query。請求方式能夠是 GET 、POST 。

GET /twitter/_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

可用的參數:

timeout：請求超時時長，限定在指定時長內響應（即便沒查完）；
from： 分頁的起始行，默認0；
size：分頁大小；
request_cache：是否緩存請求結果，默認true。
terminate_after：限定每一個分片取幾個文檔。若是設置，則響應將有一個布爾型字段terminated_early來指示查詢執行是否實際已經terminate_early。缺省爲no terminate_after；
search_type：查詢的執行方式，可選值dfs_query_then_fetch or query_then_fetch ，默認： query_then_fetch ；
batched_reduce_size：一次在協調節點上應該減小的分片結果的數量。若是請求中的潛在分片數量可能很大，則應將此值用做保護機制以減小每一個搜索請求的內存開銷。

query 元素定義查詢

query 元素用Query DSL 來定義查詢。

GET /_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

指定返回哪些內容

source filter 對_source字段進行選擇

GET /_search
{
    "_source": false,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

GET /_search
{
    "_source": [ "obj1.*", "obj2.*" ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

GET /_search
{
    "_source": "obj.*",
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

GET /_search
{
    "_source": {
        "includes": [ "obj1.*", "obj2.*" ],
        "excludes": [ "*.description" ]
    },
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

stored_fields 來指定返回哪些stored字段

GET /_search
{
    "stored_fields" : ["user", "postDate"],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

docValue Field 返回存儲了docValue的字段值

GET /_search
{
    "query" : {
        "match_all": {}
    },
    "docvalue_fields" : ["test1", "test2"]
}

version 來指定返回文檔的版本字段

GET /_search
{
    "version": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

explain 返回文檔的評分解釋

GET /_search
{
    "explain": true,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

Script Field 用腳原本對命中的每一個文檔的字段進行運算後返回

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "test1": {
      "script": {
        "lang": "painless",
        "source": "doc['balance'].value * 2"   doc指文檔
      }
    },
    "test2": {
      "script": {
        "lang": "painless",
        "source": "doc['age'].value * params.factor",
        "params": {
          "factor": 2
        }
      }
    } }}

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "script_fields": {
    "ffx": {
      "script": {
        "lang": "painless",
        "source": "doc['age'].value * doc['balance'].value"
      }
    },
    "balance*2": {
      "script": {
        "lang": "painless",
        "source": "params['_source'].balance*2"   params  _source 取 _source字段值
      }                                           官方推薦使用doc，理由是用doc效率比取_source 高。
    }
  }
}

過濾

min_score 限制最低評分得分。

GET /_search
{
    "min_score": 0.5,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

post_filter 後置過濾：在查詢命中文檔、完成聚合後，再對命中的文檔進行過濾。

如：要在一次查詢中查詢品牌爲gucci且顏色爲紅色的shirts，同時還要獲得gucci品牌各顏色的shirts的分面統計。

PUT /shirts
{
    "mappings": {
        "_doc": {
            "properties": {
                "brand": { "type": "keyword"},
                "color": { "type": "keyword"},
                "model": { "type": "keyword"}
            }
        }
    }
}

PUT /shirts/_doc/1?refresh
{
    "brand": "gucci",
    "color": "red",
    "model": "slim"
}
PUT /shirts/_doc/2?refresh
{
    "brand": "gucci",
    "color": "green",
    "model": "seec"
}

GET /shirts/_search
{
  "query": {
    "bool": {
      "filter": {
        "term": { "brand": "gucci" } 
      }
    }
  },
  "aggs": {
    "colors": {
      "terms": { "field": "color" } 
    }
  },
  "post_filter": { 
    "term": { "color": "red" }
  }
}

sort 排序

能夠指定按一個或多個字段排序。也可經過_score指定按評分值排序，_doc 按索引順序排序。默認是按相關性評分從高到低排序。

GET /bank/_search
{
  "query": {
    "match_all": {}
  },
  "sort": [           order 值：asc、desc。若是不給定，默認是asc，_score默認是desc
    {
      "age": {
        "order": "desc"
      }    },
    {
      "balance": {
        "order": "asc"
      }    },
    "_score"
  ]
}

"hits": {
    "total": 1000,
    "max_score": null,
    "hits": [
      {
        "_index": "bank",
        "_type": "_doc",
        "_id": "549",
        "_score": 1,
        "_source": {
          "account_number": 549,
          "balance": 1932, "age": 40, "state": "OR"
        },
        "sort": [              結果中每一個文檔會有排序字段值給出
          40,
          1932,
          1
        ]    }

多值字段排序

對於值是數組或多值的字段，也可進行排序，經過mode參數指定按多值的：

PUT /my_index/_doc/1?refresh
{
   "product": "chocolate",
   "price": [20, 4]
}

POST /_search
{
   "query" : {
      "term" : { "product" : "chocolate" }
   },
   "sort" : [
      {"price" : {"order" : "asc", "mode" : "avg"}}
   ]
}

Missing values 缺失該字段的文檔

GET /_search
{
    "sort" : [
        { "price" : {"missing" : "_last"} }
    ],
    "query" : {
        "term" : { "product" : "chocolate" }
    }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-request-sort.html#geo-sorting

地理空間距離排序

GET /_search
{
    "sort" : [
        {
            "_geo_distance" : {
                "pin.location" : [-70, 40],
                "order" : "asc",
                "unit" : "km",
                "mode" : "min",
                "distance_type" : "arc"
            }
        }
    ],
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

_geo_distance 距離排序關鍵字
pin.location是 geo_point 類型的字段
distance_type：距離計算方式 arc球面、plane 平面。
unit: 距離單位 km 、m 默認m

Script Based Sorting 基於腳本計算的排序

GET /_search
{
    "query" : {
        "term" : { "user" : "kimchy" }
    },
    "sort" : {
        "_script" : {
            "type" : "number",
            "script" : {
                "lang": "painless",
                "source": "doc['field_name'].value * params.factor",
                "params" : {
                    "factor" : 1.1
                }
            },
            "order" : "asc"
        }
    }
}

摺疊

用 collapse指定根據某個字段對命中結果進行摺疊

GET /bank/_search
{
    "query": {
        "match_all": {}
    },
    "collapse" : {
        "field" : "age" 
    },
    "sort": ["balance"] 
}

GET /bank/_search
{
    "query": {
        "match_all": {}
    },
    "collapse" : {
        "field" : "age" ,
        "inner_hits": {                 指定inner_hits來解釋摺疊
            "name": "details",          自命名
            "size": 5,                  指定每組取幾個文檔
            "sort": [{ "balance": "asc" }]   組內排序
        },
        "max_concurrent_group_searches": 4   指定組查詢的併發數
    },
    "sort": ["balance"] 
}

在inner_hits 中返回多個角度的組內topN

GET /twitter/_search
{
    "query": {
        "match": {
            "message": "elasticsearch"
        }
    },
    "collapse" : {
        "field" : "user", 
        "inner_hits": [
            {
                "name": "most_liked",  
                "size": 3,
                "sort": ["likes"]
            },
            {
                "name": "most_recent", 
                "size": 3,
                "sort": [{ "date": "asc" }]
            }
        ]
    },
    "sort": ["likes"]
}

分頁

from and size

GET /_search
{
    "from" : 0, "size" : 10,
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

注意：搜索請求耗用的堆內存和時間與 from + size 大小成正比。分頁越深耗用越大，爲了避免因分頁致使OOM或嚴重影響性能，ES中規定from + size 不能大於索引setting參數 index.max_result_window 的值，默認值爲 10,000。

須要深度分頁，不受index.max_result_window 限制，怎麼辦？

Search after 在指定文檔後取文檔，可用於深度分頁

GET twitter/_search
{
    "size": 10,               首次查詢第一頁
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}

GET twitter/_search
{
    "size": 10,                     後續頁的查詢
    "query": {
        "match" : {
            "title" : "elasticsearch"
        }
    },
    "search_after": [1463538857, "654323"],
    "sort": [
        {"date": "asc"},
        {"_id": "desc"}
    ]
}

注意：使用search_after，要求查詢必須指定排序，而且這個排序組合值每一個文檔惟一（最好排序中包含_id字段）。 search_after的值用的就是這個排序值。用search_after時 from 只能爲0、-1。

高亮

PUT /hl_test/_doc/1
{
  "title": "lucene solr and elasticsearch",
  "content": "lucene solr and elasticsearch for search"
}

GET /hl_test/_search
{
  "query": {
    "match": {
      "title": "lucene"
    }
  },
  "highlight": {
    "fields": {
      "title": {},
      "content": {}
    }
  }
}

GET /hl_test/_search
{
  "query": {
    "match": {
      "title": "lucene"
    }
  },
  "highlight": {
    "require_field_match": false,
    "fields": {
      "title": {},
      "content": {}
    }
  }
}

高亮結果在返回的每一個文檔中以hightlight節點給出

"highlight": {
  "title": [
	"<em>lucene</em> solr and elaticsearch"
  ]}

GET /hl_test/_search
{
  "query": {
    "match": {
      "title": "lucene"
    }
  },
  "highlight": {
    "require_field_match": false,
    "fields": {
      "title": {                   指定高亮標籤
        "pre_tags":["<strong>"],
        "post_tags": ["</strong>"]
      },
      "content": {}
    }
  }
}

Profile 爲了調試、優化

對於執行緩慢的查詢，咱們很想知道它爲何慢，時間都耗在哪了，能夠在查詢上加入上 profile 來得到詳細的執行步驟、耗時信息。

GET /twitter/_search
{
  "profile": true,
  "query" : {
    "match" : { "message" : "some number" }
  }
}

信息的說明請參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html

count api 查詢數量

PUT /twitter/_doc/1?refresh
{
    "user": "kimchy"
}

GET /twitter/_doc/_count?q=user:kimchy

GET /twitter/_doc/_count
{
    "query" : {
        "term" : { "user" : "kimchy" }
    }
}

{
    "count" : 1,
    "_shards" : {
        "total" : 5,
        "successful" : 5,
        "skipped" : 0,
        "failed" : 0
    }
}

validate api

用來檢查咱們的查詢是否正確，以及查看底層生成查詢是怎樣的。

GET twitter/_validate/query?q=user:foo

GET twitter/_doc/_validate/query
{
  "query": {                 校驗查詢
    "query_string": {
      "query": "post_date:foo",
      "lenient": false
    }
  }
}

GET twitter/_doc/_validate/query?explain=true
{
  "query": {                 得到查詢解釋
    "query_string": {
      "query": "post_date:foo",
      "lenient": false
    }
  }
}

GET twitter/_doc/_validate/query?rewrite=true
{
  "query": {
    "more_like_this": {
      "like": {                   用rewrite得到比explain 更詳細的解釋
        "_id": "2"
      },
      "boost_terms": 1
    }
  }
}

GET twitter/_doc/_validate/query?rewrite=true&all_shards=true
{
  "query": {                     得到全部分片上的查詢解釋
    "match": {
      "user": {
        "query": "kimchy",
        "fuzziness": "auto"
      }
    }
  }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html

Explain api

得到某個查詢的評分解釋,及某個文檔是否被這個查詢命中

GET /twitter/_doc/0/_explain
{
      "query" : {
        "match" : { "message" : "elasticsearch" }
      }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html

Search Shards API

讓咱們能夠了解可執行查詢的索引分片節點狀況

GET /twitter/_search_shards

想知道指定routing值的查詢將在哪些分片節點上執行

GET /twitter/_search_shards?routing=foo,baz

Search Template

POST _scripts/<templatename>
{
    "script": {
        "lang": "mustache",
        "source": {
            "query": {
                "match": {
                    "title": "{{query_string}}"
                }
            }
        }
    }
}
註冊一個模板

GET _search/template
{
    "id": "<templateName>", 
    "params": {
        "query_string": "search for these words"
    }
}
使用模板進行查詢

詳細瞭解請參考官網：

https://www.elastic.co/guide/en/elasticsearch/reference/current/search-template.html

Query DSL

官網介紹連接：

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

DSL是什麼？

Domain Specific Language：領域特定語言

Elasticsearch基於JSON提供完整的查詢DSL來定義查詢。

一個查詢可由兩部分字句構成：

1.Leaf query clauses 葉子查詢字句
Leaf query clauses 在指定的字段上查詢指定的值, 如：match, term or range queries. 葉子字句能夠單獨使用.
2.Compound query clauses 複合查詢字句
以邏輯方式組合多個葉子、複合查詢爲一個查詢

Query and filter context

一個查詢字句的行爲取決於它是用在query context 仍是 filter context 中。

Query context 查詢上下文

用在查詢上下文中的字句回答「這個文檔有多匹配這個查詢?」。除了決定文檔是否匹配，字節匹配的文檔還會計算一個字節評分，來評定文檔有多匹配。查詢上下文由 query 元素表示。

Filter context 過濾上下文

過濾上下文由 filter 元素或 bool 中的 must not 表示。用在過濾上下文中的字節回答「這個文檔是否匹配這個查詢？」，不參與相關性評分。

被頻繁使用的過濾器將被ES自動緩存，來提升查詢性能。

GET /_search
{
  "query": {          查詢
    "bool": { 
      "must": [
        { "match": { "title":   "Search"        }}, 
        { "match": { "content": "Elasticsearch" }}  
      ],
      "filter": [     過濾
        { "term":  { "status": "published" }}, 
        { "range": { "publish_date": { "gte": "2015-01-01" }}} 
      ]
    }
  }
}
提示：在查詢上下文中使用查詢子句來表示影響匹配文檔得分的條件，並在過濾上下文中使用全部其餘查詢子句。

官網介紹連接：

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html

Match all query

查詢全部

GET /_search
{
    "query": {
        "match_all": {}
    }
}

GET /_search
{
    "query": {
        "match_none": {}
    }
}

Full text querys

全文查詢，用於對分詞的字段進行搜索。會用查詢字段的分詞器對查詢的文本進行分詞生成查詢。可用於短語查詢、模糊查詢、前綴查詢、臨近查詢等查詢場景

https://www.elastic.co/guide/en/elasticsearch/reference/current/full-text-queries.html

match query

全文查詢的標準查詢，它能夠對一個字段進行模糊、短語查詢。 match queries 接收 text/numerics/dates, 對它們進行分詞分析, 再組織成一個boolean查詢。可經過operator 指定bool組合操做（or、and 默認是 or ），以及minimum_should_match 指定至少需多少個should(or)字句需知足。還可用ananlyzer指定查詢用的特殊分析器。

GET /_search
{
    "query": {
        "match" : {
            "message" : "this is a test"
        }
    }
}

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html

match query 示例

PUT /ftq/_doc/1
{
  "title": "lucene solr and elasticsearch",
  "content": "lucene solr and elasticsearch for search"
}

PUT /ftq/_doc/2
{
  "title": "java spring boot",
  "content": "lucene is writerd by java"
}

GET ftq/_search
{
  "query": {
    "match": {
      "title": "lucene java"
    }
  }
}

GET ftq/_doc/_validate/query?rewrite=true
{
  "query": {             看看執行的查詢
    "match": {
      "title": "lucene java"
    }
  }
}

GET ftq/_search
{
  "query": {
    "match": {
      "title": {
        "query": "lucene java",
        "operator": "and"
      }
    }
  }
}

GET ftq/_search
{
  "query": {
    "match": {
      "title": {
        "query": "ucen elatic",
        "fuzziness": 2                模糊查詢，最大編輯數爲2
      }
    }
  }
}

GET ftq/_search
{
  "query": {
    "match": {
      "content": {
        "query": "ucen elatic java",
        "fuzziness": 2,
        "minimum_should_match": 2     指定最少需知足兩個詞匹配
      }
    }
  }
}

可用max_expansions 指定模糊匹配的最大詞項數，默認是50。好比：反向索引中有 100 個詞項與 ucen 模糊匹配，只選用前50 個。

match phrase query

match_phrase 查詢用來對一個字段進行短語查詢，能夠指定 analyzer、slop移動因子。

GET ftq/_search
{
  "query": {
    "match_phrase": {
      "title": "lucene solr"
    }
  }
}

GET ftq/_search
{
  "query": {
    "match_phrase": {
      "title": "lucene elasticsearch"
    }
  }
}

GET ftq/_search
{
  "query": {
    "match_phrase": {
      "title": {
        "query": "lucene elasticsearch",
        "slop": 2
      }
    }
  }
}

match phrase prefix query

match_phrase_prefix 在 match_phrase 的基礎上支持對短語的最後一個詞進行前綴匹配

GET /_search
{
    "query": {
        "match_phrase_prefix" : {
            "message" : "quick brown f"
        }
    }
}

GET /_search
{
    "query": {
        "match_phrase_prefix" : {
            "message" : {
                "query" : "quick brown f",
                "max_expansions" : 10
            }
        }
    }
}
指定前綴匹配選用的最大詞項數量

Multi match query

若是你須要在多個字段上進行文本搜索，可用multi_match 。 multi_match在 match的基礎上支持對多個字段進行文本查詢。

GET ftq/_search
{
  "query": {
    "multi_match" : {
      "query":    "lucene java", 
      "fields": [ "title", "content" ] 
    }
  }
}

GET ftq/_search?explain=true
{
  "query": {
    "multi_match" : {
      "query":    "lucene elastic", 
      "fields": [ "title^5", "content" ]   //給字段的相關性評分加權重
    }
  }
}

GET ftq/_search
{
  "query": {
    "multi_match" : {
      "query":    "lucene java", 
      "fields": [ "title", "cont*" ] 
    }
  }
}

Common terms query

common 經常使用詞查詢

問一、什麼是停用詞？索引時作停用詞處理的目的是什麼？

問二、若是在索引時應用停用詞處理，下面的兩個查詢會查詢什麼詞項？
the brown fox
not happy

問三、索引時應用停用詞處理對搜索精度是否有影響？若是不作停用詞處理又會有什麼影響？如何協調這兩個問題？如何保證搜索的精確度又兼顧搜索性能？

tf-idf 相關性計算模型簡介

tf：term frequency 詞頻：指一個詞在一篇文檔中出現的頻率。

如「世界盃」在文檔A中出現3次，那麼能夠定義「世界盃」在文檔A中的詞頻爲3。請問在一篇3000字的文章中出現「世界盃」3次和一篇150字的文章中出現3詞，哪篇文章更是與「世界盃」有關的。也就是說，簡單用出現次數做爲頻率不夠準確。那就用佔比來表示：

問：tf值越大是否就必定說明這個詞更相關？

說明：tf的計算不必定非是這樣的，能夠定義不一樣的計算方式。

df：document frequency 詞的文檔頻率：指包含某個詞的文檔數（有多少文檔中包含這個詞）。 df越大的詞越常見，哪些詞會是高頻詞？

問1：詞的df值越大說明這個詞在這個文檔集中是越重要仍是越不重要？

問2：詞t的tf高，在文檔集中的重要性也高，是否說明文檔與該詞越相關？舉例：整個文檔集中只有3篇文檔中有「世界盃」，文檔A中就出現了「世界級」好幾回

問3：如何用數值體現詞t在文檔集中的重要性？df能夠嗎？

用文檔總數 / df 能夠嗎？

idf：inverse document frequency 詞的逆文檔頻率：用來表示詞在文檔集中的重要性。文檔總數/ df ，df越小，詞越重要，這個值會很大，那就對它取個天然對數，將值映射到一個較小的取值範圍。

說明： +1 是爲了不除0（即詞t在文檔集中未出現的狀況）

tf-idf 相關性性計算模型：

Common terms query

common 區分經常使用（高頻）詞查詢讓咱們能夠經過cutoff_frequency來指定一個分界文檔頻率值，將搜索文本中的詞分爲高頻詞和低頻詞，低頻詞的重要性高於高頻詞，先對低頻詞進行搜索並計算全部匹配文檔相關性得分；而後再搜索和高頻詞匹配的文檔，這會搜到不少文檔，但只對和低頻詞重疊的文檔進行相關性得分計算（這可保證搜索精確度，同時大大提升搜索性能），和低頻詞累加做爲文檔得分。實際執行的搜索是必須包含低頻詞 + 或包含高頻詞。

思考：這樣處理下，若是用戶輸入的都是高頻詞如「to be or not to be」結果會是怎樣的？你但願是怎樣的？

優化：若是都是高頻詞，那就對這些詞進行and 查詢。

進一步優化：讓用戶能夠本身定對高頻詞作and/or 操做，本身定對低頻詞進行and/or 操做；或指定最少得多少個同時匹配。

Common terms query

GET /_search
{
    "query": {
        "common": {
            "message": {
                "query": "this is bonsai cool",
                "cutoff_frequency": 0.001
            }
        }
    }
}
cutoff_frequency : 值大於1表示文檔數，0-1.0表示佔比。
此處界定 文檔頻率大於 0.1%的詞爲高頻詞。

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "low_freq_operator": "and"
            }
        }
    }
}

可用參數：minimum_should_match (high_freq, low_freq), low_freq_operator (default 「or」) and high_freq_operator (default 「or」)、 boost and analyzer

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant as a cartoon",
                "cutoff_frequency": 0.001,
                "minimum_should_match": 2
            }
        }
    }
}

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "nelly the elephant not as a cartoon",
                "cutoff_frequency": 0.001,
                "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
            }
        }
    }
}

GET /_search
{
    "query": {
        "common": {
            "body": {
                "query": "how not to be",
                "cutoff_frequency": 0.001,
                "minimum_should_match": {
                    "low_freq" : 2,
                    "high_freq" : 3
                }
            }
        }
    }
}
粗略等於右邊的查詢

GET /_search
{
    "query": {
        "bool": {
            "should": [
            { "term": { "body": "how"}},
            { "term": { "body": "not"}},
            { "term": { "body": "to"}},
            { "term": { "body": "be"}}
            ],
            "minimum_should_match": "3<50%"
        }
    }
}

Query string query

query_string 查詢，讓咱們能夠直接用lucene查詢語法寫一個查詢串進行查詢，ES中接到請求後，經過查詢解析器解析查詢串生成對應的查詢。使用它要求掌握lucene的查詢語法。

GET /_search
{
    "query": {
        "query_string" : {
            "default_field" : "content",
            "query" : "this AND that OR thus"
        }
    }
}

GET /_search
{
    "query": {
        "query_string" : {
            "fields" : ["content", "name.*^5"],
            "query" : "this AND that OR thus"
        }
    }
}

可與query同用的參數，如 default_field、fields，及query 串的語法請參考：

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html

查詢描述規則語法（查詢解析語法）：

Term 詞項：

單個詞項的表示：電腦
短語的表示： "聯想筆記本電腦"

Field 字段：

字段名:
示例： name:「聯想筆記本電腦」 AND type:電腦
若是name是默認字段，則可寫成：「聯想筆記本電腦」 AND type:電腦
若是查詢串是：type:電腦計算機手機
注意：只有第一個是type的值，後兩個則是使用默認字段。

Term Modifiers 詞項修飾符：

統配符：

? 單個字符
* 0個或多個字符
示例：te?t test* te*t
注意：通配符不可用在開頭。

模糊查詢 : 詞後加 ~

示例： roam~
模糊查詢最大支持兩個不一樣字符。
示例： roam~1

正則表達式： /xxxx/

示例： /[mb]oat/

臨近查詢 : 短語後加 ~移動值

示例： "jakarta apache"~10

範圍查詢：

mod_date:[20020101 TO 20030101] 包含邊界值
title:{Aida TO Carmen} 不包含邊界值

詞項加權 : 使該詞項的相關性更高，經過 ^數值來指定加權因子，默認加權因子值是1

示例：如要搜索包含 jakarta apache 的文章，jakarta更相關，則：
jakarta^4 apache
短語也能夠： "jakarta apache"^4 "Apache Lucene"

查詢描述規則語法（查詢解析語法）：

Boolean 操做符 Lucene支持的布爾操做： AND, 「+」, OR, NOT ,"-"

Simple Query string query

simple_query_string 查同 query_string 查詢同樣用lucene查詢語法寫查詢串，較query_string不一樣的地方：更小的語法集；查詢串有錯誤，它會忽略錯誤的部分，不拋出錯誤。更適合給用戶使用。

GET /_search
{
  "query": {
    "simple_query_string" : {
        "query": "\"fried eggs\" +(eggplant | potato) -frittata",
        "fields": ["title^5", "body"],
        "default_operator": "and"
    }
  }
}

語法請參考：https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html

Term level querys

https://www.elastic.co/guide/en/elasticsearch/reference/current/term-level-queries.html

Term query

term 查詢用於查詢指定字段包含某個詞項的文檔。

POST _search
{
  "query": {
    "term" : { "user" : "Kimchy" } 
  }
}

GET _search
{
  "query": {
    "bool": {
      "should": [
        {
          "term": {
            "status": {
              "value": "urgent",
              "boost": 2.0                  加權重
            }
          }
        },
        {
          "term": {
            "status": "normal" 
          }
        }      ]    }  }}

terms 查詢用於查詢指定字段包含某些詞項的文檔。

GET /_search
{
    "query": {
        "terms" : { "user" : ["kimchy", "elasticsearch"]}
    }
}

Terms 查詢支持嵌套查詢的方式來得到查詢詞項，至關於 in (select term from other)

Terms query 嵌套查詢示例

PUT /users/_doc/2
{
    "followers" : ["1", "3"]
}

PUT /tweets/_doc/1
{
    "user" : "1"
}

GET /tweets/_search
{
    "query" : {
        "terms" : {
            "user" : {
                "index" : "users",
                "type" : "_doc",
                "id" : "2",
                "path" : "followers"
            }
        }    }}

嵌套查詢可用參數說明：

range query

GET _search
{
    "query": {
        "range" : {
            "age" : {
                "gte" : 10,
                "lte" : 20,
                "boost" : 2.0
            }
        }
    }
}

GET _search
{
    "query": {
        "range" : {
            "date" : {
                "gte" : "now-1d/d",
                "lt" :  "now/d"
            }
        }
    }
}

GET _search
{
    "query": {
        "range" : {
            "born" : {
                "gte": "01/01/2012",
                "lte": "2013",
                "format": "dd/MM/yyyy||yyyy"
            }
        }
    }
}

時間舍入 ||說明：

時間數學計算規則請參考：

https://www.elastic.co/guide/en/elasticsearch/reference/current/common-options.html#date-math

exists query

查詢指定字段值不爲空的文檔。至關 SQL 中的 column is not null

GET /_search
{
    "query": {
        "exists" : { "field" : "user" }
    }
}

GET /_search
{
    "query": {             查詢指定字段值爲空的文檔
        "bool": {
            "must_not": {
                "exists": {
                    "field": "user"
                }
            }        }    }}

prefix query 詞項前綴查詢

GET /_search
{ "query": {
    "prefix" : { "user" : "ki" }
  }
}

GET /_search
{ "query": {
    "prefix" : { "user" :  { "value" : "ki", "boost" : 2.0 } }
  }
}

wildcard query 通配符查詢：？ *

GET /_search
{
    "query": {
        "wildcard" : { "user" : "ki*y" }
    }
}

GET /_search
{
  "query": {
    "wildcard": {
      "user": {
        "value": "ki*y",
        "boost": 2
      }
    }
  }}

regexp query 正則查詢

GET /_search
{
    "query": {
        "regexp":{
            "name.first": "s.*y"
        }
    }
}

GET /_search
{
    "query": {
        "regexp":{
            "name.first":{
                "value":"s.*y",
                "boost":1.2
            }
        }
    }
}

正則語法請參考：

https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html#regexp-syntax

fuzzy query 模糊查詢

GET /_search
{
    "query": {
       "fuzzy" : { "user" : "ki" }
    }
}

GET /_search
{
    "query": {
        "fuzzy" : {
            "user" : {
                "value": "ki",
                "boost": 1.0,
                "fuzziness": 2,
                "prefix_length": 0,
                "max_expansions": 100
            }
        }
    }
}

type query mapping type 查詢

GET /_search
{
    "query": {
        "type" : {
            "value" : "_doc"
        }
    }
}

ids query 根據文檔id查詢

GET /_search
{
    "query": {
        "ids" : {
            "type" : "_doc",
            "values" : ["1", "4", "100"]
        }
    }
}

Compound querys 複合查詢

https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html

Constant Score query

用來包裝另外一個查詢，將查詢匹配的文檔的評分設爲一個常值。

GET /_search
{
    "query": {
        "constant_score" : {
            "filter" : {
                "term" : { "user" : "kimchy"}
            },
            "boost" : 1.2
        }
    }
}

Bool query

Bool 查詢用bool操做來組合多個查詢字句爲一個查詢。可用的關鍵字：

POST _search
{
  "query": {
    "bool" : {
      "must" : {
        "term" : { "user" : "kimchy" }
      },
      "filter": {
        "term" : { "tag" : "tech" }
      },
      "must_not" : {
        "range" : {
          "age" : { "gte" : 10, "lte" : 20 }
        }
      },
      "should" : [
        { "term" : { "tag" : "wow" } },
        { "term" : { "tag" : "elasticsearch" } }
      ],
      "minimum_should_match" : 1,
      "boost" : 1.0
    }
  }
}

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。