HanLP中文分詞Lucene插件

時間 2019-11-06

標籤 hanlp 中文分詞 lucene 插件简体版

原文原文鏈接

基於HanLP，支持包括Solr（7.x）在內的任何基於Lucene（7.x）的系統。git

Mavengithub

<dependency>web

<groupId>com.hankcs.nlp</groupId>app

<artifactId>hanlp-lucene-plugin</artifactId>webapp

<version>1.1.6</version>spa

</dependency>插件

Solr快速上手xml

1.將hanlp-portable.jar和hanlp-lucene-plugin.jar共兩個jar放入${webapp}/WEB-INF/lib下。（或者使用mvn package對源碼打包，拷貝target/hanlp-lucene-plugin-x.x.x.jar到${webapp}/WEB-INF/lib下）blog

2. 修改solr core的配置文件${core}/conf/schema.xml：token

</analyzer>

</analyzer>

</fieldType>

· 若是你的業務系統中有其餘字段，好比location，summary之類，也須要一一指定其type="text_cn"。切記，不然這些字段仍舊是solr默認分詞器。

· 另外，切記不要在query中開啓indexMode，不然會影響PhaseQuery。indexMode只需在index中開啓一遍便可。

高級配置

目前本插件支持以下基於schema.xml的配置:

更高級的配置主要經過class path下的hanlp.properties進行配置，請閱讀HanLP天然語言處理包文檔以瞭解更多相關配置，如：

0.用戶詞典

1.詞性標註

2.簡繁轉換

3.……

停用詞與同義詞

推薦利用Lucene或Solr自帶的filter實現，本插件不會越俎代庖。一個示例配置以下：

調用方法

在Query改寫的時候，能夠利用HanLPAnalyzer分詞結果中的詞性等屬性，如

String text = "中華人民共和國很遼闊";

for (int i = 0; i < text.length(); ++i)

{

System.out.print(text.charAt(i) + "" + i + " ");

}

System.out.println();

Analyzer analyzer = new HanLPAnalyzer();

TokenStream tokenStream = analyzer.tokenStream("field", text);

tokenStream.reset();

while (tokenStream.incrementToken())

{

CharTermAttribute attribute = tokenStream.getAttribute(CharTermAttribute.class);

// 偏移量

OffsetAttribute offsetAtt = tokenStream.getAttribute(OffsetAttribute.class);

// 距離

PositionIncrementAttribute positionAttr = tokenStream.getAttribute(PositionIncrementAttribute.class);

// 詞性

TypeAttribute typeAttr = tokenStream.getAttribute(TypeAttribute.class);

System.out.printf("[%d:%d %d] %s/%s\n", offsetAtt.startOffset(), offsetAtt.endOffset(), positionAttr.getPositionIncrement(), attribute, typeAttr.type());

}

在另外一些場景，支持以自定義的分詞器（好比開啓了命名實體識別的分詞器、繁體中文分詞器、CRF分詞器等）構造HanLPTokenizer，好比：

tokenizer = new HanLPTokenizer(HanLP.newSegment()

.enableJapaneseNameRecognize(true)

.enableIndexMode(true), null, false);

tokenizer.setReader(new StringReader("林志玲亮相網友:肯定不是波多野結衣？"));

文章摘自：2019 github