ES 入門 - 基於詞項的查詢

時間 2020-10-02

標籤入門基於查詢简体版

原文原文鏈接

準備

首先先聲明下，我這裏使用的 ES 版本 5.2.0.html

爲了便於理解，這裏以以下 index 爲格式，該格式是經過 PMACCT 抓取的 netflow 流量信息, 文中所涉及的到的例子，全基於此 index.node

本篇涉及的內容能夠理解爲 ES 的入門內容，主要針對詞項的過濾，爲基礎篇。json

{
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "_source": {
                    "collector": "node1",
                    "src_port": "443",
                    "timestamp": 1600359600,
                    "device_ip": "1.1.1.1",
                    "flows": "40",
                    "dst_host": "2.2.2.2",
                    "TAG": 10001,
                    "router_ip": 172698718,
                    "dst_port": "16384",
                    "pkts": 40000,
                    "bits": 320000000000,
                    "src_host": "3.3.3.3"
                }
            },

在正式介紹搜索前，先明確一個概念。不少人在學習 ES 查詢前，容易對 Term 和全文查詢進行混淆。數組

首先，Term 是表達語義的最小單位，在搜索和利用統計語言模型時都須要處理 Term.緩存

對應在 ES 裏，針對 Term 查詢的輸入來講，不會作任何的分詞處理，會把輸入做爲一個總體，在 ES 的倒排索引中進行詞項的匹配，而後利用算分公式將結果返回。並能夠經過 Constant Score 將查詢轉換爲一個 Filtering，避免算分，利用緩存，從而提升性能。less

雖然輸入時，不作分詞處理，但在搜索時，會作分詞處理。這樣有時就會出現沒法搜索出結果的狀況，好比有 name 爲 ‘Jack’ 的 doc. 但若是在搜索時，輸入 Jack，ES 是沒法查詢到的。必須改爲小寫的 jack 或者使用 keyword 進行查詢。elasticsearch

Term 查詢包含：ide

Term Query
Range Query
Exists Query
Prefix Query
Wildcard Query

而全文查詢，是基於全文本的查詢。性能

在 ES 中，索引（輸入）和搜索時都會分詞。先將查詢的字符串傳遞到合適分詞器中，而後生成一個供查詢的詞項列表。學習

全文查詢包括：

Match Query
Match Phrase Query
Query String Query

而下面的例子全都是基於 Term 查詢。

ES 搜索概述

ES 搜索 API 能夠分爲兩大類：

基於 URL 的參數搜索, 適合簡單的搜索。
基於 Request Body 的搜索（DSL），適合更爲複雜的搜索。

肯定查詢的索引範圍：

/_search: 集羣上的全部索引

/index1/_search: index1 索引

/index1,index2/_search: index1 和 index2 索引

/index*/_search: 以 index 開頭的全部索引

URL 查詢

指定字段查詢：

使用 q 指定參數，經過KV 間鍵值對查詢。

舉例1：查詢設備 IP 爲 1.1.1.1 的相關文檔信息：

/shflows_agg_*/_search?q=device_ip:1.1.1.1 

{
    "profile": "true"
}

profile 的意思是查看查詢過程

結果：能夠看到 type 爲 TermQuery，搜索時根據指定字段："device_ip:10.75.44.94"

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "device_ip:1.1.1.1",
                                "time": "445.8407320ms",
............

泛查詢

不明確指定查詢的 key，只指定 value，會對文檔中全部 key 進行匹配

舉例2：查詢各個屬性中帶有 1.1.1.1 字符的文檔, 好比若是 src_host 或者 dst_host 中出現 1.1.1.1，相關文檔也會被查詢出來。

/shflows_agg_*/_search?q=10.75.44.94

{
    "profile": "true"
}

結果：能夠看到 description 變爲 _all

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "TermQuery",
                                "description": "_all:1.1.1.1",
 ......

DSL 查詢

方法：經過在 body 中，編寫 json 進行更爲複雜的查詢

查詢全部文檔

舉例1：查詢當前 index 全部文檔：

/shflows_agg_index1/_search

{
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

對文檔進行排序和分頁

舉例2：查詢當前 index 全部文檔，按照時間排序

/shflows_agg_index1/_search

{
    "from": 10,
    "size": 20,
    "sort": [{"timestamp": "desc"}],
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

指定文檔返回的參數

舉例：指定文檔中，返回的僅是指定的參數

/shflows_agg_index1/_search

{
    "_source": ["timestamp", "device_ip"],
    "query": {
        "match_all": {} # 返回全部 doc
    }
}

使用腳本字段，對文檔中的多個值進行腳本運算

舉例：將文檔中的，源 ip 和源端口進行拼接，並以 ip_address 進行命名：

/shflows_agg_index1/_search

{
    "script_fields": {
        "ip_address":{
            "script": {
                "lang": "painless",
                "inline": "params.comment + doc['device_ip'].value + ':' + doc['dst_port'].value",
                "params" : {
                    "comment" : "ip address is: " 
                }
            }
        }
    },
    "query": {
        "match_all": {} 
    }
}

結果：在 fields 裏多出了新的腳本拼接後的字段

{
    "took": 84,
    "timed_out": false,
    "_shards": {
        "total": 10,
        "successful": 10,
        "failed": 0
    },
    "hits": {
        "total": 36248845,
        "max_score": 1.0,
        "hits": [
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_0",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:16384"
                    ]
                }
            },
            {
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 1.0,
                "fields": {
                    "ip_address": [
                        "ip address is: 10.75.44.94:443"
                    ]
                }
            },
.......

Query Context OR Filter Context 查詢

在 ES 中，搜索過程有 Query 和 Filter 上下文兩種：

Query 查詢：在搜索過程當中會進行相關性的算分操做
Filter 查詢：不須要進行算分，因此能夠利用緩存，得到更好的性能

在 Query 和 Filter 查詢裏能夠進行：

等值查詢（term）
範圍查詢（range）

舉例：如查詢 dst_port 爲 443 的 doc，並打分

/shflows_agg_index1/_search

{
    "profile": "true",
    "explain": true,
    "query": {
        "term": {"dst_port": 443}
    }
}

結果：

{
    "took": 191,
    "timed_out": false,
    "_shards": {
        "total": 11,
        "successful": 11,
        "failed": 0
    },
    "hits": {
        "total": 3871488,
        "max_score": 2.2973032,
        "hits": [
            {
                "_shard": "[shflows_agg_1600358400][0]",
                "_node": "RWTixYPtTieZaRgAH0NOkQ",
                "_index": "shflows_agg_1600358400",
                "_type": "shflows_agg",
                "_id": "node1_1600359600_0_172698718_shflows_agg_5",
                "_score": 2.2973032,  ####### 能夠看到這裏有計算的分數
                "_source": {
                    "collector": "node1",
                    "src_port": "16384",
                    "timestamp": 1600359600,

使用 filter 查詢：

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不進行算分操做
    "constant_score": {
      "filter": {
        "term": {
          "dst_port": 443
        }
      }
    }
  }
}

結果：

"hits": {
        "total": 3872768,
        "max_score": 1.0, # 1.0 爲固定值
.....

"profile": {
        "shards": [
            {
                "id": "[e_Ac3cNJRtmVxFW9DwOwjA][shflows_agg_1600531200][0]",
                "searches": [
                    {
                        "query": [
                            {
                                "type": "ConstantScoreQuery", ## 不變分數查詢
                                "description": "ConstantScore(dst_port:443)",

舉例：terms 查詢，查詢 dst_port 爲 443 和 22 doc

/shflows_agg_index1/_search

{
  "profile": "true",
  "explain": true,
  "query": {
   # 使用 constant_score 不進行算分操做
    "constant_score": {
      "filter": {
        "terms": {
          "dst_port": [443,22]
        }
      }
    }
  }
}

舉例：數據範圍查詢

{
    "profile": "true",
    "explain": true,
    "query": {
        "constant_score": {
            "filter": {
                "range": {
                    "timestamp": {
                        # 大於等於
                        "gte": 1601049600,
                        # 小於等於
                        "lte": 1601308800
                    }
                }
            }
        }
    }
}

Bool 複合查詢：多個條件進行篩選

在 ES 能夠經過 bool 查詢，將一個或者多個查詢子句組合或者嵌套到一塊兒，實現更爲複雜的查詢。

bool 查詢共包含 4 個子句：

must：搜索的結果必須匹配，參與算分
should：選擇性匹配，相似於 OR，知足一個條件就能夠，參與算分
must_not: 必須不能匹配，屬於 Filter context，不貢獻算分
filter：必須匹配，屬於 Filter context ，不貢獻算分。

must_not 和 filter 性能更好，不須要算分。

舉例：查詢時間範圍在 1601171628 和 1601175228 之間，目的端口爲 80，源目的 IP 在 [1.1.1.1 ,1.1.1.2, 1.1.1.3] 中任意一個的 doc 信息。

{
    "profile": "true",
    "explain": true,
    "query": {
        "bool": {
            "must": [
                {
                    "range": {
                        "timestamp": {
                            "gte": 1601171628,
                            "lte": 1601175228
                        }
                    }
                },
                {
                    "term": {
                        "dst_port": 80
                    }
                }，
                {
                    "bool": {
                     # 注意這裏 should 在 must 的數組裏，若是和 must 同級，是沒法影響 must 的結果的。
                    "should": [
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            },
                            {
                                "term": {
                                    "src_host": "1.1.1.1"
                                }
                            }
                        ]
                    }
                }
            ]
        }
    }
}

參考

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。