ElastricSearch IK分詞

時間 2019-12-17

原文原文鏈接

IK的一些操做：php

1.查看集羣健康情況
GET /_cat/health?v&pretty

2.查看my_index的mapping和setting的相關信息
GET /my_index?pretty

3.查看全部的index
GET /_cat/indices?v&pretty

4.刪除 my_index_new


DELETE /my_index_new?pretty&pretty

先測試ik分詞器的基本功能html

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "中華人民共和國國歌"
}

結果：java

{
  "tokens": [
    {
      "token": "中華人民共和國",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "國歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 1
    }
  ]
}

能夠看出：經過ik_smart明顯很智能的將 "中華人民共和國國歌"進行了正確的分詞。nginx

另一個例子：git

GET _analyze?pretty
{
  "analyzer": "ik_smart",
  "text": "王者榮耀是最好玩的遊戲"
}

結果：github

{
  "tokens": [
    {
      "token": "王者榮耀",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "最",
      "start_offset": 5,
      "end_offset": 6,
      "type": "CN_CHAR",
      "position": 1
    },
    {
      "token": "好玩",
      "start_offset": 6,
      "end_offset": 8,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "遊戲",
      "start_offset": 9,
      "end_offset": 11,
      "type": "CN_WORD",
      "position": 3
    }
  ]
}

若是結果跟個人不同，那就對了，中文ik分詞詞庫裏面將「王者榮耀」是分開的，可是咱們又不肯意將其分開，根據github上面的指示能夠配置app

IKAnalyzer.cfg.xml 目錄在：elasticsearch-5.4.0/plugins/ik/configelasticsearch

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE properties SYSTEM "http://java.sun.com/dtd/properties.dtd">
<properties>
    <comment>IK Analyzer 擴展配置</comment>
    <!--用戶能夠在這裏配置本身的擴展字典 -->
    <entry key="ext_dict">custom/mydict.dic;custom/single_word_low_freq.dic</entry>
    <!--用戶能夠在這裏配置本身的擴展中止詞字典-->
    <entry key="ext_stopwords">custom/ext_stopword.dic</entry>
    <!--用戶能夠在這裏配置遠程擴展字典，下面是配置在nginx路徑下面的 -->
    <entry key="remote_ext_dict">http://tagtic-slave01:82/HotWords.php</entry>
    <!--用戶能夠在這裏配置遠程擴展中止詞字典-->
    <!-- <entry key="remote_ext_stopwords">words_location</entry> -->
    <entry key="remote_ext_stopwords">http://tagtic-slave01:82/StopWords.php</entry>
</properties>

能夠看到HotWords.phpide

<?php 
$s = <<<'EOF'
王者榮耀
陰陽師
EOF;
header("Content-type: text/html; charset=utf-8"); 
header('Last-Modified: '.gmdate('D, d M Y H:i:s', time()).' GMT', true, 200);
header('ETag: "5816f349-19"');
echo $s;
?>

配置完了以後就能夠看到剛纔的結果了學習

順便測試一下ik_max_word

GET /index/_analyze?pretty
{
  "analyzer": "ik_max_word",
  "text": "中華人民共和國國歌"
}

結果看看就好了

{
  "tokens": [
    {
      "token": "中華人民共和國",
      "start_offset": 0,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 0
    },
    {
      "token": "中華人民",
      "start_offset": 0,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 1
    },
    {
      "token": "中華",
      "start_offset": 0,
      "end_offset": 2,
      "type": "CN_WORD",
      "position": 2
    },
    {
      "token": "華人",
      "start_offset": 1,
      "end_offset": 3,
      "type": "CN_WORD",
      "position": 3
    },
    {
      "token": "人民共和國",
      "start_offset": 2,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 4
    },
    {
      "token": "人民",
      "start_offset": 2,
      "end_offset": 4,
      "type": "CN_WORD",
      "position": 5
    },
    {
      "token": "共和國",
      "start_offset": 4,
      "end_offset": 7,
      "type": "CN_WORD",
      "position": 6
    },
    {
      "token": "共和",
      "start_offset": 4,
      "end_offset": 6,
      "type": "CN_WORD",
      "position": 7
    },
    {
      "token": "國",
      "start_offset": 6,
      "end_offset": 7,
      "type": "CN_CHAR",
      "position": 8
    },
    {
      "token": "國歌",
      "start_offset": 7,
      "end_offset": 9,
      "type": "CN_WORD",
      "position": 9
    }
  ]
}

再看看github上面的一個例子

POST /index/fulltext/_mapping
{
  "fulltext": {
    "_all": {
      "analyzer": "ik_smart"
    },
    "properties": {
      "content": {
        "type": "text"
      }
    }
  }
}

存一些值

POST /index/fulltext/1
{
  "content": "美國留給伊拉克的是個爛攤子嗎"
}

POST /index/fulltext/2
{
  "content": "公安部：各地校車將享最高路權"
}

POST /index/fulltext/3
{
  "content": "中韓漁警衝突調查：韓警平均天天扣1艘中國漁船"
}

POST /index/fulltext/4
{
  "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
}

取值

POST /index/fulltext/_search
{
  "query": {
    "match": {
      "content": "中國"
    }
  }
}

結果

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 1.0869478,
    "hits": [
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "4",
        "_score": 1.0869478,
        "_source": {
          "content": "中國駐洛杉磯領事館遭亞裔男子槍擊 嫌犯已自首"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "3",
        "_score": 0.61094594,
        "_source": {
          "content": "中韓漁警衝突調查：韓警平均天天扣1艘中國漁船"
        }
      },
      {
        "_index": "index",
        "_type": "fulltext",
        "_id": "1",
        "_score": 0.27179778,
        "_source": {
          "content": "美國留給伊拉克的是個爛攤子嗎"
        }
      }
    ]
  }
}

es會按照分詞進行索引，而後根據你的查詢條件按照分數的高低給出結果

官網有一個例子，能夠學習學習：https://github.com/medcl/elasticsearch-analysis-ik

看另外一個有趣的例子

PUT /index1
{
  "settings": {
     "refresh_interval": "5s",
     "number_of_shards" :   1, 
     "number_of_replicas" : 0 
  },
  "mappings": {
    "_default_":{
      "_all": { "enabled":  false } 
    },
    "resource": {
      "dynamic": false, 
      "properties": {
        "title": {
          "type": "text",
          "fields": {
            "cn": {
              "type": "text",
              "analyzer": "ik_smart"
            },
            "en": {
              "type": "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

field的做用有二：

1.好比一個string類型能夠映射成text類型來進行全文檢索，keyword類型做爲排序和聚合;
2 至關於起了個別名，使用不一樣的分類器

批量插入值

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
{ "title": "周星馳最新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
{ "title": "周星馳最好看的新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
{ "title": "周星馳最新電影，最好，新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
{ "title": "最最最最好的新新新新電影" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
{ "title": "I'm not happy about the foxes" }

取值

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title"
    }
  }
}

結果

{
  "took": 1,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0
  },
  "hits": {
    "total": 0,
    "max_score": null,
    "hits": []
  }
}

緣由，使用title裏面查詢fox,而title使用的是Standard標準分詞器，被索引的是foxes，因此不會有結果，下面這種狀況就會有結果了

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "fox",
      "fields": "title.en"
    }
  }
}

結果就不列出來了，由於title.en使用的是english分詞器

對比一下下面的輸出，體會一下field的使用

GET /index1/resource/_search
{
  "query": {
    "match": {
      "title.cn": "the最好遊戲"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新遊戲",
      "fields": [ "title", "title.cn", "title.en" ]
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "the最新",
      "fields": "title.cn"
    }
  }
}

根據結果體會體會用法

下面使用「王者榮耀作測試」，這裏能夠看到前面配置的HotWords.php是一把雙刃劍，將「王者榮耀」放在裏面以後，「王者榮耀」這個詞就是一個總體，不會被切分紅「王者」和「榮耀」，可是就是要搜索王者怎麼辦呢，這裏就體現出fields的強大了，具體看下面

先存入數據

POST /_bulk
{ "create": { "_index": "index1", "_type": "resource", "_id": 6 } }
{ "title": "王者榮耀最好玩的遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 7 } }
{ "title": "王者榮耀最好玩的新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 8 } }
{ "title": "王者榮耀最新遊戲，最好玩，新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 9 } }
{ "title": "最最最最好的新新新新遊戲" }
{ "create": { "_index": "index1", "_type": "resource", "_id": 10 } }
{ "title": "I'm not happy about the foxes" }

查詢

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者榮耀",
      "fields": "title.cn"
    }
  }
}

#下面會沒有結果返回
POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title.cn"
    }
  }
}

POST /index1/resource/_search
{
  "query": {
    "multi_match": {
      "type":     "most_fields", 
      "query":    "王者",
      "fields": "title"
    }
  }
}

對比結果就能夠一目瞭然了，結果略！

因此一開始業務的需求要至關了解，纔能有好的映射（mapping）被設計，搜索的時候也會省事很多

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。