ElasticSearch系列學習html
ElasticSearch第二步-CRUD之Sense github
ElasticSearch第五步-.net平臺下c#操做ElasticSearch詳解elasticsearch
elasticsearch官方只提供smartcn這個中文分詞插件,效果不是很好,好在國內有medcl大神(國內最先研究es的人之一)寫的兩個中文分詞插件,一個是ik的,一個是mmseg的,下面分別介紹ik的用法,post
當咱們建立一個index(庫db_news)時,easticsearch默認提供的分詞器db_news,分詞結果會把每一個漢字分開,而不是咱們想要的根據關鍵詞來分詞。例如:學習
代碼以下:url
GET /db_news/_analyze?analyzer=standard
{
我愛北京天安門
}
分詞結果以下:spa
{ "tokens": [ { "token": "我", "start_offset": 6, "end_offset": 7, "type": "<IDEOGRAPHIC>", "position": 1 }, { "token": "愛", "start_offset": 7, "end_offset": 8, "type": "<IDEOGRAPHIC>", "position": 2 }, { "token": "北", "start_offset": 8, "end_offset": 9, "type": "<IDEOGRAPHIC>", "position": 3 }, { "token": "京", "start_offset": 9, "end_offset": 10, "type": "<IDEOGRAPHIC>", "position": 4 }, { "token": "天", "start_offset": 10, "end_offset": 11, "type": "<IDEOGRAPHIC>", "position": 5 }, { "token": "安", "start_offset": 11, "end_offset": 12, "type": "<IDEOGRAPHIC>", "position": 6 }, { "token": "門", "start_offset": 12, "end_offset": 13, "type": "<IDEOGRAPHIC>", "position": 7 } ] }
正常狀況下,這不是咱們想要的結果,好比咱們更但願 「我」,「愛」,「北京」,"天安門"這樣的分詞,這樣咱們就須要安裝中文分詞插件,ik就是實現這個功能的。
安裝ik插件
第一種方式是直接下載配置,這種方式比較麻煩(對於Windows用戶來說),這裏我也不講了
下載地址:https://github.com/medcl/elasticsearch-analysis-ik
********************************************************************************************
第二種方式是直接下載elasticsearch中文發行版。下載地址是:https://github.com/medcl/elasticsearch-rtf。從新運行安裝。
執行命令:
GET /db_news/_analyze?analyzer=ik
{
我愛北京天安門啊王軍華
}
結果以下:
{ "tokens": [ { "token": "我", "start_offset": 6, "end_offset": 7, "type": "CN_CHAR", "position": 1 }, { "token": "愛", "start_offset": 7, "end_offset": 8, "type": "CN_CHAR", "position": 2 }, { "token": "北京", "start_offset": 8, "end_offset": 10, "type": "CN_WORD", "position": 3 }, { "token": "天安門", "start_offset": 10, "end_offset": 13, "type": "CN_WORD", "position": 4 }, { "token": "啊", "start_offset": 13, "end_offset": 14, "type": "CN_CHAR", "position": 5 }, { "token": "王軍", "start_offset": 14, "end_offset": 16, "type": "CN_WORD", "position": 6 }, { "token": "華", "start_offset": 16, "end_offset": 17, "type": "CN_CHAR", "position": 7 } ] }
關於分詞器定義須要注意的地方
若是咱們直接建立索引庫,會使用默認的分詞進行分詞,這不是咱們想要的結果。這個時候咱們再去更改分詞器會報錯以下:
{ "error": "IndexAlreadyExistsException[[db_news] already exists]", "status": 400 }
並且沒有辦法解決衝突,惟一的辦法是刪除已經存在的索引,新建一個索引,並制定mapping使用新的分詞器(注意要在數據插入以前,不然會使用elasticsearch默認的分詞器)。
新建索引命令以下:
PUT /db_news { "settings" : { "analysis" : { "analyzer" : { "stem" : { "tokenizer" : "standard", "filter" : ["standard", "lowercase", "stop", "porter_stem"] } } } }, "mappings" : { "person" : { "dynamic" : true, "properties" : { "intro" : { "type" : "string",
"indexAnalyzer" : "ik",
"searchAnalyzer":"ik"
}
}
}
}
}
查看新建的索引:
GET /db_news/_mapping
結果以下:
{ "db_news": { "mappings": { "person": { "dynamic": "true", "properties": { "age": { "type": "long" }, "intro": { "type": "string", "analyzer": "ik" }, "name": { "type": "string" } } } } } }
更新映射
說明:對於db_news/news,開始沒有字段msgs,後來添加了這個字段,那麼要先修改索引方式,在新增數據
PUT /db_news/_mapping/news { "properties" : { "msgs" : { "type" : "string", "indexAnalyzer" : "ik", "searchAnalyzer":"ik" } } }
ElasticSearch系列學習