es 修改拼音分詞器源碼實現漢字/拼音/簡拼混合搜索時同音字不匹配

時間 2020-11-26

標籤 html git github 數組 app elasticsearch 工具學習 spa 插件欄目 HTML 简体版

原文原文鏈接

　　在業務中常常會用到拼音匹配查詢，你們都會用到拼音分詞器，可是拼音分詞器匹配的時候有個問題，就是會出現同音字匹配，有時候這種狀況是業務不但願出現的。git

　　業務場景：我輸入"純生pi酒"進行搜索，文檔中有如下數據:github

doc[1]:{"name":"純生啤酒"}數組

doc[2]:{"name":"春生啤酒"}app

doc[3]:{"name":"純生劈酒"}elasticsearch

以上業務點是我輸入"純生pi酒"理論上業務但願只返回doc[1]:{"name":"純生啤酒"}和doc[3]:{"name":"純生劈酒"}其餘的不是我要的數據，由於從業務角度來看，我已經輸入"純生"了，理論上只須要返回有"純生"的數據(固然也有不少狀況，會但願把"春生"也返回來)，正常使用拼音分詞器，會把doc[2]也會返回，緣由是拼音分詞器會把doc[2]變成:工具

{
  "tokens": [
    {
      "token": "c",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "chun",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "s",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "sheng",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "p",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "pi",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "j",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "jiu",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    }
  ]
}

因爲"純生"和"春生"是同音字，分詞結果doc[1]和doc[2]是同樣的，因此把doc[2]匹配上就是理所固然了，那麼如何解決?學習

　　其實咱們的需求是就當輸入搜索文本時(搜索文本中可能同時存在中文/拼音),搜索文本中有[中文] 則按[中文]匹配，有[拼音]則按[拼音]匹配便可，這樣就屏蔽掉了輸入中文時匹配到同音字的問題。那麼咱們能夠這樣思考，咱們索引的時候同時存在全拼/簡拼/中文三種分詞，搜索的時候輸入中有中文則按中文一個個分開，有英文則按拼音進行分詞便可例如：spa

索引時"純生啤酒"分詞爲:插件

索引分詞:
{
  "tokens": [
    {
      "token": "c",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "chun",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "純",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "s",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "sheng",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "生",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "p",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "pi",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "啤",
      "start_offset": 2,
      "end_offset": 3,
      "type": "word",
      "position": 2
    },
    {
      "token": "j",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "jiu",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    },
    {
      "token": "酒",
      "start_offset": 3,
      "end_offset": 4,
      "type": "word",
      "position": 3
    }
  ]
}

搜索"純生pi酒",分詞爲:

搜索分詞:
{
  "tokens": [
    {
      "token": "純",
      "start_offset": 0,
      "end_offset": 1,
      "type": "word",
      "position": 0
    },
    {
      "token": "生",
      "start_offset": 1,
      "end_offset": 2,
      "type": "word",
      "position": 1
    },
    {
      "token": "pi",
      "start_offset": 2,
      "end_offset": 4,
      "type": "word",
      "position": 2
    },
    {
      "token": "酒",
      "start_offset": 4,
      "end_offset": 5,
      "type": "word",
      "position": 3
    }
  ]
}

這樣就能夠只匹配出有"純"|"生"|"酒"這幾個字的數據了,而不是把"春"字的doc也匹配出來，既然解決思路有了，就找方案了。

　　因爲目前的es的拼音分詞器是沒有分離中文並保留中文的功能，因此就須要修改其源碼增長這個功能(使用的拼音分詞器: https://github.com/medcl/elasticsearch-analysis-pinyin)

　　源碼的話在上面地址上能夠下在，源碼的原理大概講一下，就是他調用一個nlp工具包( https://github.com/NLPchina)先對輸入文本解析成拼音即"純生pi酒"會解析成["chun","sheng",null,null,"酒"]數組(這裏再提一句這個nlp工具包會對詞組進行解析，而不是單個字進行解析例如"廈/門"會解析成"xia/men"而不是"sha/men"這個確實有用不少，固然他還有不少工具，例如簡繁體轉化等等，你們能夠學習使用一哈)，而後再單獨對英文數字放到buff裏面進行二次匹配，採用"正向最大匹配"和"逆向最大匹配"取出最優解(這些都是經常使用的分詞手法)匹配出拼音字符，源代碼以下：

// 分別正向、逆向最大匹配，選出最短的做爲最優結果

List<String> forward = positiveMaxMatch(pinyinText, PINYIN_MAX_LENGTH);
if (forward.size() == 1) { // 前向只切出1個的話，沒有必要再作逆向分詞
    pinyinList.addAll(forward);
} else {
    // 分別正向、逆向最大匹配，選出最短的做爲最優結果
    List<String> backward = reverseMaxMatch(pinyinText, PINYIN_MAX_LENGTH);
    if (forward.size() <= backward.size()) {
        pinyinList.addAll(forward);
    } else {
        pinyinList.addAll(backward);
    }
}

至於拼音字典匹配結構因爲拼音的數量很少，拼音源碼採用了HashSet的結構而不是咱們ik裏面的字典樹。("正向最大匹配"和"逆向最大匹配"百度一大把就不在這說了)

　　原理大概講完了根據需求咱們是不須要管英文數字這一塊的匹配邏輯的，只須要修改中文轉拼音這附近的邏輯便可。

　　首先咱們先寫一箇中文分割的工具類或者方法以下：

public class ChineseUtil {
    /**
     * 漢字始
     */
    public static char CJK_UNIFIED_IDEOGRAPHS_START = '\u4E00';
    /**
     * 漢字止
     */
    public static char CJK_UNIFIED_IDEOGRAPHS_END = '\u9FA5';

    public static List<String> segmentChinese(String str){
        if (StringUtil.isBlank(str)) {
            return Collections.emptyList();
        }
        
        List<String> lists = str.length()<=32767?new ArrayList<>(str.length()):new LinkedList<>();
        for (int i=0;i<str.length();i++){
            char c = str.charAt(i);
            if(c>=CJK_UNIFIED_IDEOGRAPHS_START&&c<=CJK_UNIFIED_IDEOGRAPHS_END){
                lists.add(String.valueOf(c));
            }
            else{
                lists.add(null);
            }

        }
        return lists;
    }
}

漢字始或者漢字止這個查一下nlp工具的源碼(PinyinUtil)就能夠找到，或者百度。而後在拼音源碼中的PinyinConfig類中添加一項中文分割的配置：

默認false就能夠了，而後咱們須要修改兩個類(PinyinTokenFilter/PinyinTokenizer)，這兩個類是最要的分詞類，對應es的analysis的filter和tokenizer

　　因爲這兩個類修改地方是同樣的我就隨便講一個，首先須要修改構造器的校驗，添加剛剛增長的配置：

而後修改該類的readTerm()方法，以下：

兩個類都修改完就完成源碼修改了，如今須要對源碼從新進行打包，mvn install如下就能夠了，你就會拿到elasticsearch-analysis-pinyin-5.6.4.jar(你下載源碼的時候要下載release的版本進行修改，版本也要對應你的es哦)，同時在源碼的lib拿到nlp-lang-1.7.jar包，再加上resource中的plugin-descriptor.properties(這個須要定義插件版本，啓動類等東西，這個去拼音release版本中找個可用的插件解壓一下跟着配置就能夠了)，最後變成下面這個樣子：

放在一個文件夾裏面，這個就是打包好的插件了，名字本身命名便可，而後放到es的plugin目錄裏面就完成修改了。

　　剩下就是修改index的setting和mapping，修改思想就是按照開頭說的那樣search_analyzer和analyzer分開便可，以下:

PUT /test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "pinyin_chinese_analyzer": {
          "tokenizer": "pinyin_tokenizer"
        },
        "pinyin_analyzer": {
          "tokenizer": "pinyin_chinese_tokenizer"
        }
      }, 
      "tokenizer": {
        "pinyin_chinese_tokenizer": {
          "type": "pinyin",
          "keep_first_letter": false,
          "keep_separate_first_letter": false,
          "keep_full_pinyin":false,
          "keep_original":false,
          "limit_first_letter_length":50,
          "keep_separate_chinese": true,
          "lowercase":true
          
        },
        "pinyin_tokenizer": {
          "type": "pinyin",
          "keep_first_letter": false,
          "keep_separate_first_letter": true,
          "keep_full_pinyin":true,
          "keep_original":false,
          "limit_first_letter_length":50,
          "keep_separate_chinese": true,
          "lowercase":true
        }
      }
    }
  }
  , "mappings": {
    "indexType":{
      "properties": {
        "name":{
          "type": "text",
          "search_analyzer": "pinyin_chinese_analyzer",
          "analyzer": "pinyin_analyzer"
        }
      }
    }
  }
}

查詢使用match_pharse便可(使用原理能夠參考個人文章http://www.javashuo.com/article/p-ghfvagog-nx.html)，固然也能夠用其餘，根據業務來把。

　　下面是簡單的驗證結果:索引中有如下文檔doc[1]:{"name": "雪花純生啤酒200ml"}|doc[2]:{"name": "雪花純爽啤酒200ml"}|doc[3]:{"name": "雪花春生啤酒200ml"}

查詢輸入:
GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "name": "xuehcs"
    }
  }
}
結果:
"hits": [
      {
        "_index": "test_index",
        "_type": "indexType",
        "_id": "2","_source": {
          "name": "雪花純爽啤酒200ml"
        }
      },
      {
        "_index": "test_index",
        "_type": "indexType",
        "_id": "1","_source": {
          "name": "雪花純生啤酒200ml"
        }
      },
      {
        "_index": "test_index",
        "_type": "indexType",
        "_id": "3","_source": {
          "name": "雪花春生啤酒200ml"
        }
      }
    ]
查詢輸入:
GET /test_index/_search
{
  "query": {
    "match_phrase": {
      "name": "xueh純生"
    }
  }
}
結果:
"hits": [
      {
        "_index": "test_index",
        "_type": "indexType",
        "_id": "1","_source": {
          "name": "雪花純生啤酒200ml"
        }
      }
    ]

　　總結：其實解決思路並不複雜,不過其實在修改源碼以前也考慮過其餘方案,例如經過修改tokenizer爲standard或者ik+fliter爲pinyin進行分詞等,可是老是存在各類問題不盡人意,用standard的時候因爲已經拆分紅了字,因此會出現"廈門"這種多音字被轉化爲"shamen"而不是"xiamen",而ik分詞則在使用match_phrase時可控性較差~加上受詞庫的影響,最後才決定使用修改源碼增長功能的方式~若是你們有更好的方式能夠推薦一下

注意:在配置tokenizer時因爲noneChinesePinyinTokenize和keepNoneChinese屬性默認時true，因此會對字母數字等非中文字符按拼音進行分割(liudehua12->liu,de,hua,12)因此若是你有修改這些屬性就要注意了～應該要配置兩個項爲true

[說明]:elasticsearch版本5.6.4