lasticsearch中,內置了不少分詞器(analyzers),例如standard
(標準分詞器)、english
(英文分詞)和chinese
(中文分詞)。其中standard
就是無腦的一個一個詞(漢字)切分,因此適用範圍廣,可是精準度低;english
對英文更加智能,能夠識別單數負數,大小寫,過濾stopwords(例如「the」這個詞)等;chinese
效果不好,後面會演示。此次主要玩這幾個內容:安裝中文分詞ik,對比不一樣分詞器的效果,得出一個較佳的配置。關於Elasticsearch,以前還寫過兩篇文章:Elasticsearch的安裝,運行和基本配置 和 備份和恢復,須要的能夠看下。html
安裝中文分詞ik
Elasticsearch的中文分詞很爛,因此咱們須要安裝ik。首先從github上下載項目,解壓:git
- cd /tmp
- wget https://github.com/medcl/elasticsearch-analysis-ik/archive/master.zip
- unzip master.zip
- cd elasticsearch-analysis-ik/
而後使用mvn package
命令,編譯出jar包 elasticsearch-analysis-ik-1.4.0.jar。github
- mvn package
將jar包複製到Elasticsearch的plugins/analysis-ik
目錄下,再把解壓出的ik目錄(配置和詞典等),複製到Elasticsearch的config
目錄下。而後編輯配置文件elasticsearch.yml
,在後面加一行:chrome
- index.analysis.analyzer.ik.type : "ik"
重啓service elasticsearch restart
。搞定。api
若是上面的mvn搞不定的話,你能夠直接從 elasticsearch-rtf 項目中找到編譯好的jar包和配置文件(我就是怎麼幹的)。app
【2014-12-14晚更新,今天是星期天,我在vps上安裝ik分詞,一樣的步驟,老是提示MapperParsingException[Analyzer [ik] not found for field [cn]],而後晚上跑到公司,發現我公司虛擬機上Elasticsearch的版本是1.3.2,vps上是1.3.4,猜是版本問題,直接把vps從新安裝成最新的1.4.1,再安裝ik,竟然ok了……】elasticsearch
準備工做:建立索引,錄入測試數據
先爲後面的分詞器效果對比作好準備,個人Elasticsearch部署在虛擬機 192.168.159.159:9200 上的,使用chrome的postman插件直接發http請求。第一步,建立index1
索引:post
- PUT http://192.168.159.159:9200/index1
- {
- "settings": {
- "refresh_interval": "5s",
- "number_of_shards" : 1, // 一個主節點
- "number_of_replicas" : 0 // 0個副本,後面能夠加
- },
- "mappings": {
- "_default_":{
- "_all": { "enabled": false } // 關閉_all字段,由於咱們只搜索title字段
- },
- "resource": {
- "dynamic": false, // 關閉「動態修改索引」
- "properties": {
- "title": {
- "type": "string",
- "index": "analyzed",
- "fields": {
- "cn": {
- "type": "string",
- "analyzer": "ik"
- },
- "en": {
- "type": "string",
- "analyzer": "english"
- }
- }
- }
- }
- }
- }
- }
爲了方便,這裏的index1
索引,只有一個shards,沒有副本。索引裏只有一個叫resource
的type,只有一個字段title
,這就足夠咱們用了。title
自己使用標準分詞器,title.cn
使用ik分詞器,title.en
自帶的英文分詞器。而後是用bulk api批量添加數據進去:測試
- POST http://192.168.159.159:9200/_bulk
- { "create": { "_index": "index1", "_type": "resource", "_id": 1 } }
- { "title": "周星馳最新電影" }
- { "create": { "_index": "index1", "_type": "resource", "_id": 2 } }
- { "title": "周星馳最好看的新電影" }
- { "create": { "_index": "index1", "_type": "resource", "_id": 3 } }
- { "title": "周星馳最新電影,最好,新電影" }
- { "create": { "_index": "index1", "_type": "resource", "_id": 4 } }
- { "title": "最最最最好的新新新新電影" }
- { "create": { "_index": "index1", "_type": "resource", "_id": 5 } }
- { "title": "I'm not happy about the foxes" }
注意bulk api要「回車」換行,否則會報錯。spa
各類比較
一、對比ik分詞,chinese分詞和standard分詞
- POST http://192.168.159.159:9200/index1/_analyze?analyzer=ik
- 聯想召回筆記本電源線
ik測試結果:
- {
- "tokens": [
- {
- "token": "聯想",
- "start_offset": 0,
- "end_offset": 2,
- "type": "CN_WORD",
- "position": 1
- },
- {
- "token": "召回",
- "start_offset": 2,
- "end_offset": 4,
- "type": "CN_WORD",
- "position": 2
- },
- {
- "token": "筆記本",
- "start_offset": 4,
- "end_offset": 7,
- "type": "CN_WORD",
- "position": 3
- },
- {
- "token": "電源線",
- "start_offset": 7,
- "end_offset": 10,
- "type": "CN_WORD",
- "position": 4
- }
- ]
- }
自帶chinese和standard分詞器的結果:
- {
- "tokens": [
- {
- "token": "聯",
- "start_offset": 0,
- "end_offset": 1,
- "type": "<IDEOGRAPHIC>",
- "position": 1
- },
- {
- "token": "想",
- "start_offset": 1,
- "end_offset": 2,
- "type": "<IDEOGRAPHIC>",
- "position": 2
- },
- {
- "token": "召",
- "start_offset": 2,
- "end_offset": 3,
- "type": "<IDEOGRAPHIC>",
- "position": 3
- },
- {
- "token": "回",
- "start_offset": 3,
- "end_offset": 4,
- "type": "<IDEOGRAPHIC>",
- "position": 4
- },
- {
- "token": "筆",
- "start_offset": 4,
- "end_offset": 5,
- "type": "<IDEOGRAPHIC>",
- "position": 5
- },
- {
- "token": "記",
- "start_offset": 5,
- "end_offset": 6,
- "type": "<IDEOGRAPHIC>",
- "position": 6
- },
- {
- "token": "本",
- "start_offset": 6,
- "end_offset": 7,
- "type": "<IDEOGRAPHIC>",
- "position": 7
- },
- {
- "token": "電",
- "start_offset": 7,
- "end_offset": 8,
- "type": "<IDEOGRAPHIC>",
- "position": 8
- },
- {
- "token": "源",
- "start_offset": 8,
- "end_offset": 9,
- "type": "<IDEOGRAPHIC>",
- "position": 9
- },
- {
- "token": "線",
- "start_offset": 9,
- "end_offset": 10,
- "type": "<IDEOGRAPHIC>",
- "position": 10
- }
- ]
- }
結論沒必要多說,對於中文,官方的分詞器十分弱。
二、搜索關鍵詞「最新」和「fox」
測試方法:
- POST http://192.168.159.159:9200/index1/resource/_search
- {
- "query": {
- "multi_match": {
- "type": "most_fields",
- "query": "最新",
- "fields": [ "title", "title.cn", "title.en" ]
- }
- }
- }
咱們修改query
和fields
字段來對比。
1)搜索「最新」,字段限制在title.cn
的結果(只展現hit部分):
- "hits": [
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "1",
- "_score": 1.0537746,
- "_source": {
- "title": "周星馳最新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "3",
- "_score": 0.9057159,
- "_source": {
- "title": "周星馳最新電影,最好,新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "4",
- "_score": 0.5319481,
- "_source": {
- "title": "最最最最好的新新新新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "2",
- "_score": 0.33246756,
- "_source": {
- "title": "周星馳最好看的新電影"
- }
- }
- ]
再次搜索「最新」,字段限制在title
,title.en
的結果(只展現hit部分):
- "hits": [
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "4",
- "_score": 1,
- "_source": {
- "title": "最最最最好的新新新新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "1",
- "_score": 0.75,
- "_source": {
- "title": "周星馳最新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "3",
- "_score": 0.70710677,
- "_source": {
- "title": "周星馳最新電影,最好,新電影"
- }
- },
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "2",
- "_score": 0.625,
- "_source": {
- "title": "周星馳最好看的新電影"
- }
- }
- ]
結論:若是沒有使用ik中文分詞,會把「最新」當成兩個獨立的「字」,搜索準確性低。
2)搜索「fox」,字段限制在title
和title.cn
,結果爲空,對於它們兩個分詞器,fox和foxes不一樣。再次搜索「fox」,字段限制在title.en
,結果以下:
- "hits": [
- {
- "_index": "index1",
- "_type": "resource",
- "_id": "5",
- "_score": 0.9581454,
- "_source": {
- "title": "I'm not happy about the foxes"
- }
- }
- ]
結論:中文和標準分詞器,不對英文單詞作任何處理(單複數等),查全率低。
個人最佳配置
其實最開始建立的索引已是最佳配置了,在title
下增長cn
和en
兩個fields,這樣對中文,英文和其餘什麼亂七八糟文的效果都好點。就像前面說的,title
使用標準分詞器,title.cn
使用ik分詞器,title.en
使用自帶的英文分詞器,每次搜索同時覆蓋