假設有兩個句子java
java is my favourite programming langurage, and I also think spark is a very good big data system. java spark are very related, because scala is spark's programming langurage and scala is also based on jvm like java.
適用match query 搜索java sparkjvm
{ { "match": { "content": "java spark" } } }
match query 只能搜索到包含java和spark的document,可是不知道java和spark是否是離得很近。
假設咱們想要java和spark離得很近的document優先返回,就要給它一個更高的relevance score,這就涉及到了proximity match近似匹配。
下面給出要實現的兩個需求:
(1)搜索java spark,就靠在一塊兒,中間不能插入任何其它字符
(2)搜索java spark,要求java和spark兩個單詞靠的越近,doc的分數越高,排名越靠前spa
準備數據:scala
PUT /test_index/_create/1 { "content": "java is my favourite programming language, and I also think spark is a very good big data system." } PUT /test_index/_create/2 { "content": "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." }
對於需求1 搜索java spark,就靠在一塊兒,中間不能插入任何其它字符:
使用match query搜索沒法實現code
GET /test_index/_search { "query": { "match": { "content": "java spark" } } }
結果:索引
{ "took" : 16, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 2, "relation" : "eq" }, "max_score" : 0.4255141, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.4255141, "_source" : { "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." } }, { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 0.37266707, "_source" : { "content" : "java is my favourite programming language, and I also think spark is a very good big data system." } } ] } }
使用match phrase搜索就能夠實現token
GET /test_index/_search { "query": { "match_phrase": { "content": "java spark" } } }
結果:ip
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.35695744, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "2", "_score" : 0.35695744, "_source" : { "content" : "java spark are very related, because scala is spark's programming language and scala is also based on jvm like java." } } ] } }
假設咱們有兩個documentstring
doc1: hello world, java spark doc2: hi, spark java hello doc1(0) world doc1(1) java doc1(2) doc2(2) spark doc1(3) doc2(1)
position詳情以下:it
GET /_analyze { "text": ["hello world, java spark"], "analyzer": "standard" }
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 13, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "spark", "start_offset" : 18, "end_offset" : 23, "type" : "<ALPHANUM>", "position" : 3 } ] }
GET /_analyze { "text": ["hi, spark java"], "analyzer": "standard" }
{ "tokens" : [ { "token" : "hi", "start_offset" : 0, "end_offset" : 2, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "spark", "start_offset" : 4, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 10, "end_offset" : 14, "type" : "<ALPHANUM>", "position" : 2 } ] }
索引中的position,match_phrase hello world, java spark doc1 hi, spark java doc2 hello doc1(0) wolrd doc1(1) java doc1(2) doc2(2) spark doc1(3) doc2(1)
使用match_phrase查詢要求找到每一個term都在一個共有的那些doc,就是要求一個doc,必需要包含查詢的每一個term,而且知足位置運算。
doc1 --> java和spark --> spark position恰巧比java大1 --> java的position是2,spark的position是3,剛好知足條件 doc1符合條件 doc2 --> java和spark --> java position是2,spark position是1,spark position比java position小1,而不是大1 --> 光是position就不知足,那麼doc2不匹配 doc2不符合條件
含義:query string搜索文本中的幾個term,要通過幾回移動才能與一個document匹配,這個移動的次數就是slop。
實際舉一個例子:
對於hello world, java is very good, spark is also very good. 假設咱們要用match phrase 匹配到java spark。能夠發現直接進行查詢會查不到
PUT /test_index/_create/1 { "content": "hello world, java is very good, spark is also very good." } GET /test_index/_search { "query": { "match_phrase": { "content": "java spark" } } }
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 0, "relation" : "eq" }, "max_score" : null, "hits" : [ ] } }
此時使用
GET /_analyze { "text": ["hello world, java is very good, spark is also very good."], "analyzer": "standard" }
結果:
{ "tokens" : [ { "token" : "hello", "start_offset" : 0, "end_offset" : 5, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "world", "start_offset" : 6, "end_offset" : 11, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "java", "start_offset" : 13, "end_offset" : 17, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "is", "start_offset" : 18, "end_offset" : 20, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "very", "start_offset" : 21, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 4 }, { "token" : "good", "start_offset" : 26, "end_offset" : 30, "type" : "<ALPHANUM>", "position" : 5 }, { "token" : "spark", "start_offset" : 32, "end_offset" : 37, "type" : "<ALPHANUM>", "position" : 6 }, { "token" : "is", "start_offset" : 38, "end_offset" : 40, "type" : "<ALPHANUM>", "position" : 7 }, { "token" : "also", "start_offset" : 41, "end_offset" : 45, "type" : "<ALPHANUM>", "position" : 8 }, { "token" : "very", "start_offset" : 46, "end_offset" : 50, "type" : "<ALPHANUM>", "position" : 9 }, { "token" : "good", "start_offset" : 51, "end_offset" : 55, "type" : "<ALPHANUM>", "position" : 10 } ] }
java is very good spark is java spark java --> spark java --> spark java --> spark
能夠發現java的position是2,spark的position是6,那麼咱們只須要設置slop大於等於3(也就是移動3詞就能夠了)就能夠搜到了
GET /test_index/_search { "query": { "match_phrase": { "content": { "query": "java spark", "slop": 3 } } } }
結果:
{ "took" : 1, "timed_out" : false, "_shards" : { "total" : 1, "successful" : 1, "skipped" : 0, "failed" : 0 }, "hits" : { "total" : { "value" : 1, "relation" : "eq" }, "max_score" : 0.21824157, "hits" : [ { "_index" : "test_index", "_type" : "_doc", "_id" : "1", "_score" : 0.21824157, "_source" : { "content" : "hello world, java is very good, spark is also very good." } } ] } }
此時加上slop的match phrase就是proximity match近似匹配了。加上slop以後雖然是近似匹配能夠搜索到不少結果,可是距離越近的會優先返回,也就是相關度分數就會越高。