es 中文分詞主流都推薦 ik,使用簡單,做者也一直持續更新,算是Lucene 體系最好的中文分詞了。可是索引的文本每每是複雜的,不只包含中文,還有英文和數字以及一些符號。ik 分中文很好用,可是對英文和數字的組合的時候卻不盡人意,而使用場景中像型號這等英文加數字在常見不過了。json
舉個栗子:app
curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d' { "tokenizer" : "ik_max_word", "text" : "m123-test detailed output 一絲不掛 青絲變白髮" } '
獲得結果:curl
{ "tokens" : [ { "token" : "m123-test", "start_offset" : 0, "end_offset" : 9, "type" : "LETTER", "position" : 0 }, { "token" : "m", "start_offset" : 0, "end_offset" : 1, "type" : "ENGLISH", "position" : 1 }, { "token" : "123", "start_offset" : 1, "end_offset" : 4, "type" : "ARABIC", "position" : 2 }, { "token" : "test", "start_offset" : 5, "end_offset" : 9, "type" : "ENGLISH", "position" : 3 }, { "token" : "detailed", "start_offset" : 10, "end_offset" : 18, "type" : "ENGLISH", "position" : 4 }, { "token" : "output", "start_offset" : 19, "end_offset" : 25, "type" : "ENGLISH", "position" : 5 }, { "token" : "一絲不掛", "start_offset" : 26, "end_offset" : 30, "type" : "CN_WORD", "position" : 6 }, { "token" : "一絲", "start_offset" : 26, "end_offset" : 28, "type" : "CN_WORD", "position" : 7 }, { "token" : "一", "start_offset" : 26, "end_offset" : 27, "type" : "TYPE_CNUM", "position" : 8 }, { "token" : "絲", "start_offset" : 27, "end_offset" : 28, "type" : "CN_WORD", "position" : 9 }, { "token" : "不掛", "start_offset" : 28, "end_offset" : 30, "type" : "CN_WORD", "position" : 10 }, { "token" : "掛", "start_offset" : 29, "end_offset" : 30, "type" : "CN_WORD", "position" : 11 }, { "token" : "青絲", "start_offset" : 31, "end_offset" : 33, "type" : "CN_WORD", "position" : 12 }, { "token" : "絲", "start_offset" : 32, "end_offset" : 33, "type" : "CN_WORD", "position" : 13 }, { "token" : "變白", "start_offset" : 33, "end_offset" : 35, "type" : "CN_WORD", "position" : 14 }, { "token" : "白髮", "start_offset" : 34, "end_offset" : 36, "type" : "CN_WORD", "position" : 15 }, { "token" : "發", "start_offset" : 35, "end_offset" : 36, "type" : "CN_WORD", "position" : 16 } ] }
這裏中文和數字 m123 會被分紅 m, 123,因此當你搜索m123
的時候, 實際搜索的是 123。url
使用 es 內置的 tokenizer 能夠解決字母 + 數字問題, 以 standard 爲例:code
{ "tokens" : [ { "token" : "m123", "start_offset" : 0, "end_offset" : 4, "type" : "<ALPHANUM>", "position" : 0 }, { "token" : "test", "start_offset" : 5, "end_offset" : 9, "type" : "<ALPHANUM>", "position" : 1 }, { "token" : "detailed", "start_offset" : 10, "end_offset" : 18, "type" : "<ALPHANUM>", "position" : 2 }, { "token" : "output", "start_offset" : 19, "end_offset" : 25, "type" : "<ALPHANUM>", "position" : 3 }, { "token" : "一", "start_offset" : 26, "end_offset" : 27, "type" : "<IDEOGRAPHIC>", "position" : 4 }, { "token" : "絲", "start_offset" : 27, "end_offset" : 28, "type" : "<IDEOGRAPHIC>", "position" : 5 }, { "token" : "不", "start_offset" : 28, "end_offset" : 29, "type" : "<IDEOGRAPHIC>", "position" : 6 }, { "token" : "掛", "start_offset" : 29, "end_offset" : 30, "type" : "<IDEOGRAPHIC>", "position" : 7 }, { "token" : "青", "start_offset" : 31, "end_offset" : 32, "type" : "<IDEOGRAPHIC>", "position" : 8 }, { "token" : "絲", "start_offset" : 32, "end_offset" : 33, "type" : "<IDEOGRAPHIC>", "position" : 9 }, { "token" : "變", "start_offset" : 33, "end_offset" : 34, "type" : "<IDEOGRAPHIC>", "position" : 10 }, { "token" : "白", "start_offset" : 34, "end_offset" : 35, "type" : "<IDEOGRAPHIC>", "position" : 11 }, { "token" : "發", "start_offset" : 35, "end_offset" : 36, "type" : "<IDEOGRAPHIC>", "position" : 12 } ] }
m123 能夠搜獲得,但這裏一樣帶來新的問題 中文被分紅單個字了,因此你搜 一
一掛
均可以搜到結果。索引
魚和熊掌不可兼得哎,若是要兼得改 ik 分詞的方法,或者單獨爲中文或者非中文加個附加字段。前者不會,只能選後者。token