ES系列十3、Elasticsearch Suggester API(自動補全）

時間 2019-11-07

標籤系列 elasticsearch suggester api 自動補全欄目日誌分析简体版

原文原文鏈接

1.概念

1.補全api主要分爲四類

Term Suggester（糾錯補全，輸入錯誤的狀況下補全正確的單詞）
Phrase Suggester（自動補全短語，輸入一個單詞補全整個短語）
Completion Suggester(完成補全單詞，輸出如前半部分，補全整個單詞）
Context Suggester（上下文補全）

總體效果相似百度搜索，如圖：html

2.Term Suggester(糾錯補全）

2.1.api

1.創建索引算法

PUT /book4
{
  "mappings": {
    "english": {
      "properties": {
        "passage": {
          "type": "text"
        }
      }
    }
  }
}

2.插入數據

curl -H "Content-Type: application/json" -XPOST 'http:localhost:9200/_bulk' -d'
{ "index" : { "_index" : "book4", "_type" : "english" } }
{ "passage": "Lucene is cool"}
{ "index" : { "_index" : "book4", "_type" : "english" } }
{ "passage": "Elasticsearch builds on top of lucene"}
{ "index" : { "_index" : "book4", "_type" : "english" } }
{ "passage": "Elasticsearch rocks"}
{ "index" : { "_index" : "book4", "_type" : "english" } }
{ "passage": "Elastic is the company behind ELK stack"}
{ "index" : { "_index" : "book4", "_type" : "english" } }
{ "passage": "elk rocks"}
{ "index" : { "_index" : "book4", "_type" : "english" } }
{  "passage": "elasticsearch is rock solid"}
'

3.看下儲存的分詞有哪些

post /_analyze
{
  "text": [
    "Lucene is cool",
    "Elasticsearch builds on top of lucene",
    "Elasticsearch rocks",
    "Elastic is the company behind ELK stack",
    "elk rocks",
    "elasticsearch is rock solid"
  ]
}

結果：json

{
    "tokens": [
        {
            "token": "lucene",
            "start_offset": 0,
            "end_offset": 6,
            "type": "<ALPHANUM>",
            "position": 0
        },
        {
            "token": "is",
            "start_offset": 7,
            "end_offset": 9,
            "type": "<ALPHANUM>",
            "position": 1
        },
        {
            "token": "cool",
            "start_offset": 10,
            "end_offset": 14,
            "type": "<ALPHANUM>",
            "position": 2
        },
        {
            "token": "elasticsearch",
            "start_offset": 15,
            "end_offset": 28,
            "type": "<ALPHANUM>",
            "position": 103
        },
        {
            "token": "builds",
            "start_offset": 29,
            "end_offset": 35,
            "type": "<ALPHANUM>",
            "position": 104
        },
        {
            "token": "on",
            "start_offset": 36,
            "end_offset": 38,
            "type": "<ALPHANUM>",
            "position": 105
        },
        {
            "token": "top",
            "start_offset": 39,
            "end_offset": 42,
            "type": "<ALPHANUM>",
            "position": 106
        },
        {
            "token": "of",
            "start_offset": 43,
            "end_offset": 45,
            "type": "<ALPHANUM>",
            "position": 107
        },
        {
            "token": "lucene",
            "start_offset": 46,
            "end_offset": 52,
            "type": "<ALPHANUM>",
            "position": 108
        },
        {
            "token": "elasticsearch",
            "start_offset": 53,
            "end_offset": 66,
            "type": "<ALPHANUM>",
            "position": 209
        },
        {
            "token": "rocks",
            "start_offset": 67,
            "end_offset": 72,
            "type": "<ALPHANUM>",
            "position": 210
        },
        {
            "token": "elastic",
            "start_offset": 73,
            "end_offset": 80,
            "type": "<ALPHANUM>",
            "position": 311
        },
        {
            "token": "is",
            "start_offset": 81,
            "end_offset": 83,
            "type": "<ALPHANUM>",
            "position": 312
        },
        {
            "token": "the",
            "start_offset": 84,
            "end_offset": 87,
            "type": "<ALPHANUM>",
            "position": 313
        },
        {
            "token": "company",
            "start_offset": 88,
            "end_offset": 95,
            "type": "<ALPHANUM>",
            "position": 314
        },
        {
            "token": "behind",
            "start_offset": 96,
            "end_offset": 102,
            "type": "<ALPHANUM>",
            "position": 315
        },
        {
            "token": "elk",
            "start_offset": 103,
            "end_offset": 106,
            "type": "<ALPHANUM>",
            "position": 316
        },
        {
            "token": "stack",
            "start_offset": 107,
            "end_offset": 112,
            "type": "<ALPHANUM>",
            "position": 317
        },
        {
            "token": "elk",
            "start_offset": 113,
            "end_offset": 116,
            "type": "<ALPHANUM>",
            "position": 418
        },
        {
            "token": "rocks",
            "start_offset": 117,
            "end_offset": 122,
            "type": "<ALPHANUM>",
            "position": 419
        },
        {
            "token": "elasticsearch",
            "start_offset": 123,
            "end_offset": 136,
            "type": "<ALPHANUM>",
            "position": 520
        },
        {
            "token": "is",
            "start_offset": 137,
            "end_offset": 139,
            "type": "<ALPHANUM>",
            "position": 521
        },
        {
            "token": "rock",
            "start_offset": 140,
            "end_offset": 144,
            "type": "<ALPHANUM>",
            "position": 522
        },
        {
            "token": "solid",
            "start_offset": 145,
            "end_offset": 150,
            "type": "<ALPHANUM>",
            "position": 523
        }
    ]
}

View Code

4.term suggest api(搜索單個字段)

搜索下試試，給出錯誤單詞Elasticsearaach後端

POST /book4/_search
{
    "suggest" : {
    "my-suggestion" : {
      "text" : "Elasticsearaach",
      "term" : {
        "field" : "passage"，
　　　　　"suggest_mode": "popular"
      }
    }
  }
}

response:api

{
    "took": 26,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    },
    "suggest": {
        "my-suggestion": [
            {
                "text": "elasticsearaach",
                "offset": 0,
                "length": 15,
                "options": [
                    {
                        "text": "elasticsearch",
                        "score": 0.84615386,
                        "freq": 3
                    }
                ]
            }
        ]
    }
}

5.搜索多個字段分別給出提示：

POST _search
{
  "suggest": {
    "my-suggest-1" : {
      "text" : "tring out Elasticsearch",
      "term" : {
        "field" : "message"
      }
    },
    "my-suggest-2" : {
      "text" : "kmichy",
      "term" : {
        "field" : "user"
      }
    }
  }
}

該term建議者提出基於編輯距離條款。在建議術語以前分析提供的建議文本。建議的術語是根據分析的建議文本標記提供的。該term建議者不走查詢到的是是的請求部分。數據結構

常見建議選項：

`text`app	建議文字。建議文本是必需的選項，須要全局或按建議設置。curl
`field`elasticsearch	從中獲取候選建議的字段。這是一個必需的選項，須要全局設置或根據建議設置。ide
`analyzer`	用於分析建議文本的分析器。默認爲建議字段的搜索分析器。
`size`	每一個建議文本標記返回的最大更正。
`sort`	定義如何根據建議文本術語對建議進行排序。兩個可能的值： `score`：先按分數排序，而後按文檔頻率排序，再按術語自己排序。 `frequency`：首先按文檔頻率排序，而後按類似性分數排序，而後按術語自己排序。
`suggest_mode`	建議模式控制包含哪些建議或控制建議的文本術語，建議。能夠指定三個可能的值： `missing`：僅提供不在索引詞典中，可是在原文檔中的詞。這是默認值。 `popular`：僅提供在索引詞典中出現的詞語。 `always`：索引詞典中出沒出現的詞語都要給出建議。

其餘術語建議選項：

`lowercase_terms`	在文本分析以後，建議文本術語小寫。
`max_edits`	最大編輯距離候選建議能夠具備以便被視爲建議。只能是介於1和2之間的值。任何其餘值都會致使拋出錯誤的請求錯誤。默認爲2。
`prefix_length`	必須匹配的最小前綴字符的數量纔是候選建議。默認爲1.增長此數字可提升拼寫檢查性能。一般拼寫錯誤不會出如今術語的開頭。（舊名「prefix_len」已棄用）
`min_word_length`	建議文本術語必須具備的最小長度才能包含在內。默認爲4.（舊名稱「min_word_len」已棄用）
`shard_size`	設置從每一個單獨分片中檢索的最大建議數。在減小階段，僅根據`size`選項返回前N個建議。默認爲該 `size`選項。將此值設置爲高於該值的值`size`可能很是有用，以便以性能爲代價得到更準確的拼寫更正文檔頻率。因爲術語在分片之間被劃分，所以拼寫校訂頻率的分片級文檔可能不許確。增長這些將使這些文檔頻率更精確。
`max_inspections`	用於乘以的因子， `shards_size`以便在碎片級別上檢查更多候選拼寫更正。能夠以性能爲代價提升準確性。默認爲5。
`min_doc_freq`	建議應出現的文檔數量的最小閾值。能夠指定爲絕對數字或文檔數量的相對百分比。這能夠僅經過建議高頻項來提升質量。默認爲0f且未啓用。若是指定的值大於1，則該數字不能是小數。分片級文檔頻率用於此選項。
`max_term_freq`	建議文本令牌能夠存在的文檔數量的最大閾值，以便包括在內。能夠是表示文檔頻率的相對百分比數（例如0.4）或絕對數。若是指定的值大於1，則不能指定小數。默認爲0.01f。這可用於排除高頻術語的拼寫檢查。高頻術語一般拼寫正確，這也提升了拼寫檢查的性能。分片級文檔頻率用於此選項。
`string_distance`	用於比較相似建議術語的字符串距離實現。能夠指定五個可能的值： `internal`- 默認值基於damerau_levenshtein，但高度優化用於比較索引中術語的字符串距離。`damerau_levenshtein` - 基於Damerau-Levenshtein算法的字符串距離算法。`levenshtein` - 基於Levenshtein編輯距離算法的字符串距離算法。 `jaro_winkler` - 基於Jaro-Winkler算法的字符串距離算法。 `ngram` - 基於字符n-gram的字符串距離算法。

官方api

2.phase sguesster:短語糾錯

phrase 短語建議，在term的基礎上，會考量多個term之間的關係，好比是否同時出如今索引的原文裏，相鄰程度，以及詞頻等

示例1：

POST book4/_search

{

　 "suggest" : {

    "myss":{
      "text": "Elasticsearch rock",
      "phrase": {
        "field": "passage"
      }
    }
}
}

{
    "took": 11,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    },
    "suggest": {
        "myss": [
            {
                "text": "Elasticsearch rock",
                "offset": 0,
                "length": 18,
                "options": [
                    {
                        "text": "elasticsearch rocks",
                        "score": 0.3467123
                    }
                ]
            }
        ]
    }
}

3. Completion suggester 自動補全

針對自動補全場景而設計的建議器。此場景下用戶每輸入一個字符的時候，就須要即時發送一次查詢請求到後端查找匹配項，在用戶輸入速度較高的狀況下對後端響應速度要求比較苛刻。所以實現上它和前面兩個Suggester採用了不一樣的數據結構，索引並不是經過倒排來完成，而是將analyze過的數據編碼成FST和索引一塊兒存放。對於一個open狀態的索引，FST會被ES整個裝載到內存裏的，進行前綴查找速度極快。可是FST只能用於前綴查找，這也是Completion Suggester的侷限所在。

1.創建索引

POST /book5

{
    "mappings": {
        "music" : {
            "properties" : {
                "suggest" : { 
                    "type" : "completion"
                },
                "title" : {
                    "type": "keyword"
                }
            }
        }
    }
}

插入數據：

POST /book5/music

{
    "suggest":"test my book"
}

Input 指定輸入詞 Weight 指定排序值（可選）

PUT music/music/5nupmmUBYLvVFwGWH3cu?refresh
{
    "suggest" : {
        "input": [ "test", "book" ],
        "weight" : 34
    }
}

指定不一樣的排序值：

PUT music/_doc/6Hu2mmUBYLvVFwGWxXef?refresh
{
    "suggest" : [
        {
            "input": "test",
            "weight" : 10
        },
        {
            "input": "good",
            "weight" : 3
        }
    ]}

示例1：查詢建議根據前綴查詢

POST book5/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "te", 
            "completion" : { 
                "field" : "suggest" 
            }
        }
    }
}

{
    "took": 8,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    },
    "suggest": {
        "song-suggest": [
            {
                "text": "te",
                "offset": 0,
                "length": 2,
                "options": [
                    {
                        "text": "test my book1",
                        "_index": "book5",
                        "_type": "music",
                        "_id": "6Xu6mmUBYLvVFwGWpXeL",
                        "_score": 1,
                        "_source": {
                            "suggest": "test my book1"
                        }
                    },
                    {
                        "text": "test my book1",
                        "_index": "book5",
                        "_type": "music",
                        "_id": "6nu8mmUBYLvVFwGWSndC",
                        "_score": 1,
                        "_source": {
                            "suggest": "test my book1"
                        }
                    },
                    {
                        "text": "test my book1 english",
                        "_index": "book5",
                        "_type": "music",
                        "_id": "63u8mmUBYLvVFwGWZHdC",
                        "_score": 1,
                        "_source": {
                            "suggest": "test my book1 english"
                        }
                    }
                ]
            }
        ]
    }
}

示例2：對建議查詢結果去重

{
    "suggest": {
        "song-suggest" : {
            "prefix" : "te", 
            "completion" : { 
                "field" : "suggest" ,
                 "skip_duplicates": true 
            }
        }
    }
}

示例3：查詢建議文檔存儲短語

POST /book5/music/63u8mmUBYLvVFwGWZHdC?refresh
{
    "suggest" : {
        "input": [ "book1 english", "test english" ],
        "weight" : 20
    }
}

查詢：

POST book5/_search?pretty
{
    "suggest": {
        "song-suggest" : {
            "prefix" : "test", 
            "completion" : { 
                "field" : "suggest" ,
                "skip_duplicates": true
            }
        }
    }
}

結果：

{
    "took": 7,
    "timed_out": false,
    "_shards": {
        "total": 5,
        "successful": 5,
        "skipped": 0,
        "failed": 0
    },
    "hits": {
        "total": 0,
        "max_score": 0,
        "hits": []
    },
    "suggest": {
        "song-suggest": [
            {
                "text": "test",
                "offset": 0,
                "length": 4,
                "options": [
                    {
                        "text": "test english",
                        "_index": "book5",
                        "_type": "music",
                        "_id": "63u8mmUBYLvVFwGWZHdC",
                        "_score": 20,
                        "_source": {
                            "suggest": {
                                "input": [
                                    "book1 english",
                                    "test english"
                                ],
                                "weight": 20
                            }
                        }
                    },
                    {
                        "text": "test my book1",
                        "_index": "book5",
                        "_type": "music",
                        "_id": "6Xu6mmUBYLvVFwGWpXeL",
                        "_score": 1,
                        "_source": {
                            "suggest": "test my book1"
                        }
                    }
                ]
            }
        ]
    }
}

4. 總結和建議

所以用好Completion Sugester並非一件容易的事，實際應用開發過程當中，須要根據數據特性和業務須要，靈活搭配analyzer和mapping參數，反覆調試纔可能得到理想的補全效果。

回到篇首搜索框的補全/糾錯功能，若是用ES怎麼實現呢？我能想到的一個的實現方式:

在用戶剛開始輸入的過程當中，使用Completion Suggester進行關鍵詞前綴匹配，剛開始匹配項會比較多，隨着用戶輸入字符增多，匹配項愈來愈少。若是用戶輸入比較精準，可能Completion Suggester的結果已經夠好，用戶已經能夠看到理想的備選項了。
若是Completion Suggester已經到了零匹配，那麼能夠猜想是否用戶有輸入錯誤，這時候能夠嘗試一下Phrase Suggester。
若是Phrase Suggester沒有找到任何option，開始嘗試term Suggester。

精準程度上(Precision)看： Completion > Phrase > term，而召回率上(Recall)則反之。從性能上看，Completion Suggester是最快的，若是能知足業務需求，只用Completion Suggester作前綴匹配是最理想的。 Phrase和Term因爲是作倒排索引的搜索，相比較而言性能應該要低很多，應儘可能控制suggester用到的索引的數據量，最理想的情況是通過必定時間預熱後，索引能夠全量map到內存。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。