elasticsearch學習筆記高級篇（十二）——掌握phrase matching搜索技術

時間 2019-12-04

標籤 elasticsearch 學習筆記高級十二掌握 phrase matching 搜索技術欄目日誌分析简体版

原文原文鏈接

一、什麼是近似搜索

假設有兩個句子java

java is my favourite programming langurage, and I also think spark is a very good big data system.

java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java.

適用match query 搜索java sparkjvm

{
    {
        "match": {
            "content": "java spark"
        }
    }
}

match query 只能搜索到包含java和spark的document,可是不知道java和spark是否是離得很近。
假設咱們想要java和spark離得很近的document優先返回，就要給它一個更高的relevance score,這就涉及到了proximity match近似匹配。
下面給出要實現的兩個需求：
（1）搜索java spark，就靠在一塊兒，中間不能插入任何其它字符
（2）搜索java spark，要求java和spark兩個單詞靠的越近，doc的分數越高，排名越靠前spa

二、match phrase

準備數據：scala

PUT /test_index/_create/1
{
  "content": "java is my favourite programming language, and I also think spark is a very good big data system."
}
PUT /test_index/_create/2
{
  "content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
}

對於需求1 搜索java spark，就靠在一塊兒，中間不能插入任何其它字符：
使用match query搜索沒法實現code

GET /test_index/_search
{
  "query": {
    "match": {
      "content": "java spark"
    }
  }
}

結果：索引

{
  "took" : 16,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 2,
      "relation" : "eq"
    },
    "max_score" : 0.4255141,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.4255141,
        "_source" : {
          "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
        }
      },
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.37266707,
        "_source" : {
          "content" : "java is my favourite programming language, and I also think spark is a very good big data system."
        }
      }
    ]
  }
}

使用match phrase搜索就能夠實現token

GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}

結果:ip

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.35695744,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "2",
        "_score" : 0.35695744,
        "_source" : {
          "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java."
        }
      }
    ]
  }
}

三、term position

假設咱們有兩個documentstring

doc1: hello world, java spark
doc2: hi, spark java

hello doc1(0)
world doc1(1)
java  doc1(2) doc2(2)
spark doc1(3) doc2(1)

position詳情以下：it

GET /_analyze
{
  "text": ["hello world, java spark"],
  "analyzer": "standard"
}

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "spark",
      "start_offset" : 18,
      "end_offset" : 23,
      "type" : "<ALPHANUM>",
      "position" : 3
    }
  ]
}

GET /_analyze
{
  "text": ["hi, spark java"],
  "analyzer": "standard"
}

{
  "tokens" : [
    {
      "token" : "hi",
      "start_offset" : 0,
      "end_offset" : 2,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "spark",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 10,
      "end_offset" : 14,
      "type" : "<ALPHANUM>",
      "position" : 2
    }
  ]
}

四、match phrase基本原理

索引中的position，match_phrase

hello world, java spark        doc1
hi, spark java                doc2

hello         doc1(0)        
wolrd        doc1(1)
java        doc1(2) doc2(2)
spark        doc1(3) doc2(1)

使用match_phrase查詢要求找到每一個term都在一個共有的那些doc,就是要求一個doc，必需要包含查詢的每一個term，而且知足位置運算。

doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2，spark的position是3，剛好知足條件
doc1符合條件
doc2 --> java和spark --> java position是2，spark position是1，spark position比java position小1，而不是大1 --> 光是position就不知足，那麼doc2不匹配
doc2不符合條件

五、slop參數

含義：query string搜索文本中的幾個term,要通過幾回移動才能與一個document匹配，這個移動的次數就是slop。
實際舉一個例子：
對於hello world, java is very good, spark is also very good. 假設咱們要用match phrase 匹配到java spark。能夠發現直接進行查詢會查不到

PUT /test_index/_create/1
{
  "content": "hello world, java is very good, spark is also very good."
}

GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "content": "java spark"
    }
  }
}

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 0,
      "relation" : "eq"
    },
    "max_score" : null,
    "hits" : [ ]
  }
}

此時使用

GET /_analyze
{
  "text": ["hello world, java is very good, spark is also very good."],
  "analyzer": "standard"
}

結果：

{
  "tokens" : [
    {
      "token" : "hello",
      "start_offset" : 0,
      "end_offset" : 5,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "world",
      "start_offset" : 6,
      "end_offset" : 11,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "java",
      "start_offset" : 13,
      "end_offset" : 17,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "is",
      "start_offset" : 18,
      "end_offset" : 20,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "very",
      "start_offset" : 21,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 4
    },
    {
      "token" : "good",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "<ALPHANUM>",
      "position" : 5
    },
    {
      "token" : "spark",
      "start_offset" : 32,
      "end_offset" : 37,
      "type" : "<ALPHANUM>",
      "position" : 6
    },
    {
      "token" : "is",
      "start_offset" : 38,
      "end_offset" : 40,
      "type" : "<ALPHANUM>",
      "position" : 7
    },
    {
      "token" : "also",
      "start_offset" : 41,
      "end_offset" : 45,
      "type" : "<ALPHANUM>",
      "position" : 8
    },
    {
      "token" : "very",
      "start_offset" : 46,
      "end_offset" : 50,
      "type" : "<ALPHANUM>",
      "position" : 9
    },
    {
      "token" : "good",
      "start_offset" : 51,
      "end_offset" : 55,
      "type" : "<ALPHANUM>",
      "position" : 10
    }
  ]
}

java        is        very        good        spark        is

java        spark
java        -->        spark
java                -->            spark
java                            -->            spark

能夠發現java的position是2，spark的position是6，那麼咱們只須要設置slop大於等於3（也就是移動3詞就能夠了）就能夠搜到了

GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "content": {
        "query": "java spark",
        "slop": 3
      }
    }
  }
}

結果：

{
  "took" : 1,
  "timed_out" : false,
  "_shards" : {
    "total" : 1,
    "successful" : 1,
    "skipped" : 0,
    "failed" : 0
  },
  "hits" : {
    "total" : {
      "value" : 1,
      "relation" : "eq"
    },
    "max_score" : 0.21824157,
    "hits" : [
      {
        "_index" : "test_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 0.21824157,
        "_source" : {
          "content" : "hello world, java is very good, spark is also very good."
        }
      }
    ]
  }
}

此時加上slop的match phrase就是proximity match近似匹配了。加上slop以後雖然是近似匹配能夠搜索到不少結果，可是距離越近的會優先返回，也就是相關度分數就會越高。