es 中英文字母分詞問題

時間 2020-01-04

原文原文鏈接

ik

es 中文分詞主流都推薦 ik，使用簡單，做者也一直持續更新,算是Lucene 體系最好的中文分詞了。可是索引的文本每每是複雜的，不只包含中文，還有英文和數字以及一些符號。ik 分中文很好用，可是對英文和數字的組合的時候卻不盡人意，而使用場景中像型號這等英文加數字在常見不過了。json

舉個栗子：app

curl -XGET 'localhost:9200/_analyze?pretty' -H 'Content-Type: application/json' -d'
{
 "tokenizer" : "ik_max_word",
 "text" : "m123-test detailed output 一絲不掛 青絲變白髮"
}
'

獲得結果:curl

{
  "tokens" : [
    {
      "token" : "m123-test",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "LETTER",
      "position" : 0
    },
    {
      "token" : "m",
      "start_offset" : 0,
      "end_offset" : 1,
      "type" : "ENGLISH",
      "position" : 1
    },
    {
      "token" : "123",
      "start_offset" : 1,
      "end_offset" : 4,
      "type" : "ARABIC",
      "position" : 2
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "ENGLISH",
      "position" : 3
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "ENGLISH",
      "position" : 4
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "ENGLISH",
      "position" : 5
    },
    {
      "token" : "一絲不掛",
      "start_offset" : 26,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 6
    },
    {
      "token" : "一絲",
      "start_offset" : 26,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 7
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "TYPE_CNUM",
      "position" : 8
    },
    {
      "token" : "絲",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "CN_WORD",
      "position" : 9
    },
    {
      "token" : "不掛",
      "start_offset" : 28,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 10
    },
    {
      "token" : "掛",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "CN_WORD",
      "position" : 11
    },
    {
      "token" : "青絲",
      "start_offset" : 31,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 12
    },
    {
      "token" : "絲",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "CN_WORD",
      "position" : 13
    },
    {
      "token" : "變白",
      "start_offset" : 33,
      "end_offset" : 35,
      "type" : "CN_WORD",
      "position" : 14
    },
    {
      "token" : "白髮",
      "start_offset" : 34,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 15
    },
    {
      "token" : "發",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "CN_WORD",
      "position" : 16
    }
  ]
}

這裏中文和數字 m123 會被分紅 m, 123，因此當你搜索m123的時候, 實際搜索的是 123。url

使用 es 內置的 tokenizer 能夠解決字母 + 數字問題, 以 standard 爲例:code

{
  "tokens" : [
    {
      "token" : "m123",
      "start_offset" : 0,
      "end_offset" : 4,
      "type" : "<ALPHANUM>",
      "position" : 0
    },
    {
      "token" : "test",
      "start_offset" : 5,
      "end_offset" : 9,
      "type" : "<ALPHANUM>",
      "position" : 1
    },
    {
      "token" : "detailed",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "<ALPHANUM>",
      "position" : 2
    },
    {
      "token" : "output",
      "start_offset" : 19,
      "end_offset" : 25,
      "type" : "<ALPHANUM>",
      "position" : 3
    },
    {
      "token" : "一",
      "start_offset" : 26,
      "end_offset" : 27,
      "type" : "<IDEOGRAPHIC>",
      "position" : 4
    },
    {
      "token" : "絲",
      "start_offset" : 27,
      "end_offset" : 28,
      "type" : "<IDEOGRAPHIC>",
      "position" : 5
    },
    {
      "token" : "不",
      "start_offset" : 28,
      "end_offset" : 29,
      "type" : "<IDEOGRAPHIC>",
      "position" : 6
    },
    {
      "token" : "掛",
      "start_offset" : 29,
      "end_offset" : 30,
      "type" : "<IDEOGRAPHIC>",
      "position" : 7
    },
    {
      "token" : "青",
      "start_offset" : 31,
      "end_offset" : 32,
      "type" : "<IDEOGRAPHIC>",
      "position" : 8
    },
    {
      "token" : "絲",
      "start_offset" : 32,
      "end_offset" : 33,
      "type" : "<IDEOGRAPHIC>",
      "position" : 9
    },
    {
      "token" : "變",
      "start_offset" : 33,
      "end_offset" : 34,
      "type" : "<IDEOGRAPHIC>",
      "position" : 10
    },
    {
      "token" : "白",
      "start_offset" : 34,
      "end_offset" : 35,
      "type" : "<IDEOGRAPHIC>",
      "position" : 11
    },
    {
      "token" : "發",
      "start_offset" : 35,
      "end_offset" : 36,
      "type" : "<IDEOGRAPHIC>",
      "position" : 12
    }
  ]
}

m123 能夠搜獲得，但這裏一樣帶來新的問題中文被分紅單個字了，因此你搜 一 一掛 均可以搜到結果。索引

魚和熊掌不可兼得哎,若是要兼得改 ik 分詞的方法，或者單獨爲中文或者非中文加個附加字段。前者不會，只能選後者。token