ElasticSearch自定義分析器-集成結巴分詞插件

關於結巴分詞 ElasticSearch 插件:html

https://github.com/huaban/elasticsearch-analysis-jiebac++

該插件由huaban開發。支持Elastic Search 版本<=2.3.5。git

結巴分詞分析器github

結巴分詞插件提供3個分析器:jieba_index、jieba_search和jieba_other。c#

  1. jieba_index: 用於索引分詞,分詞粒度較細;
  2. jieba_search: 用於查詢分詞,分詞粒度較粗;
  3. jieba_other: 全角轉半角、大寫轉小寫、字符分詞;

使用jieba_index或jieba_search分析器,能夠實現基本的分詞效果。app

如下是最小配置示例:elasticsearch

{
    "mappings": {
        "test": {
            "_all": {
                "enabled": false
            },
            "properties": {
                "name": {
                    "type": "string",
                    "analyzer": "jieba_index",
                    "search_analyzer": "jieba_index"
                }
            }
        }
    }
}

在生產化境中,由於業務的須要,須要考慮實現如下功能:ide

  1. 支持同義詞;
  2. 支持字符過濾器;

結巴插件提供的分析器jieba_index、jieba_search沒法實現以上功能。ui

自定義分析器spa

當jieba_index、jieba_search分析器不知足生成環境的需求時,咱們能夠使用自定義分析器來解決以上問題。

分析器是由字符過濾器,分詞器,詞元過濾器組成的。

一個分詞器容許包含多個字符過濾器+一個分詞器+多個詞元過濾器。

因業務的需求,咱們須要使用映射字符過濾器來實現分詞前某些字符串的替換操做。如將用戶輸入的c#替換爲csharp,c++替換爲cplus。

下面逐一介紹分析器各個組成部分。

1. 映射字符過濾器Mapping Char Filter

這個是Elastic Search內置的映射字符過濾器,位於settings –> analysis -> char_filter下:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings": [
                      "c# => csharp",
                      "c++ => cplus"
                  ]
                }
            }
        }
    }
}

也能夠經過文件載入字符映射表。

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                    "mappings_path": "mappings.txt"
                }
            }
        }
    }
}

文件默認存放config目錄下,即config/ mappings.txt。

2. 結巴分詞詞元過濾器JiebaTokenFilter

JiebaTokenFilter接受一個SegMode參數,該參數有兩個可選值:Index和Search。

咱們預先定義兩個詞元過濾器:jieba_index_filter和jieba_search_filter。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
        }
    }
}

 這兩個詞元過濾器將分別用於索引分析器和查詢分析器。

3. stop 停用詞詞元過濾器

因分詞詞元過濾器JiebaTokenFilter並不處理停用詞。所以咱們在自定義分析器時,須要定義停用詞詞元過濾器來處理停用詞。 

Elastic Search提供了停用詞詞元過濾器,咱們能夠這樣來定義:

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type":       "stop",
                    "stopwords": ["and", "is", "the"]
                }
            }
        }
    }
}

也能夠經過文件載入停用詞列表 

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                }
            }
        }
    }
}

文件默認存放config目錄下,即config/ stopwords.txt。

4. synonym 同義詞詞元過濾器

咱們使用ElasticSearch內置同義詞詞元過濾器來實現同義詞的功能。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter": {
                    "type": "synonym",
                    "stopwords": [
                      "中文,漢語,漢字"
                  ]
                }
            }
        }
    }
}

若是同義詞量比較大時,推薦使用文件的方式載入同義詞庫。

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                }
            }
        }
    }
}

5. 從新定義分析器jieba_index和jieba_search

Elastic Search支持多級分詞,咱們使用whitespace分詞做爲分詞器;並在詞元過濾器加入定義好的Jiebie分詞詞元過濾器:jieba_index_filter和jieba_search_filter。 

PUT /my_index
{
    "settings": {
        "analysis": {
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

注意,上面分析器的命名依然使用jieba_index和jieba_search,以便覆蓋結巴分詞插件提供的分析器。

當存在多個同名的分析器時,Elastic Search會優先使用索引配置中定義的分析器。

這樣在代碼調用層面便無需再更改。 

下面是完整的配置:

PUT /my_index
{
    "settings": {
        "analysis": {
            "char_filter": {
                "mapping_filter": {
                    "type": "mapping",
                  "mappings_path": "mappings.txt"
                }
            }
            "filter": {
                "synonym_filter ": {
                    "type": "synonym",
                    "stopwords_path": "synonyms.txt"
                },
                "stop_filter": {
                    "type": "stop",
                    "stopwords_path": "stopwords.txt"
                },
                "jieba_index_filter": {
                    "type": "jieba",
                    "seg_mode": "index"
                },
                "jieba_search_filter": {
                    "type": "jieba",
                    "seg_mode": "search"
                }
            }
            "analyzer": {
                "jieba_index": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_index_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                },
                "jieba_search": {
                    "char_filter": [
                      "mapping_filter"
                  ],
                    "tokenizer": "whitespace",
                    "filter": [
                      "jieba_search_filter",
                      "stop_filter",
                      "synonym_filter"
                  ]
                }
            }
        }
    }
}

 參考資料:

https://www.elastic.co/guide/en/elasticsearch/reference/2.3/index.html

http://www.tuicool.com/articles/eUJJ3qF

相關文章
相關標籤/搜索