Solr 5.x集成中文分詞word，mmseg4j

時間 2019-11-08

標籤 solr 5.x 集成中文分詞 word mmseg4j mmseg 欄目 Microsoft Office 简体版

原文原文鏈接

使用標準分詞器，如圖：
使用word分詞器
1. 下載word-1.3.jar，注意solr的版本和word分詞的版本
2. 將文件word-1.3.jar拷貝至文件夾C:\workspace\Tomcat7.0\webapps\solr\WEB-INF\lib\下
3. 修改以下文件C:\workspace\solr_home\solr\mysolr\conf\schema.xml
在schema節點下添加以下節點： git
<fieldType name="word_cn" class="solr.TextField"> github
<analyzer type="index"> web
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/> redis
</analyzer> 算法
<analyzer type="query"> 數組
<tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory"/> tomcat
</analyzer> app
</fieldType> webapp
如圖： elasticsearch
1. 添加分詞字段
  <field name="content_wordsplit" type="word_cn" indexed="true" stored="true" multiValued="true"/>
2. 重啓tomcat
3. 驗證分詞
4. 發現同程被分詞分開了，須要將"同程"添加到詞庫中
5. 編輯C:\workspace\solr_home\solr\mysolr\conf\schema.xml文件，修改以下：
  <fieldType name="word_cn" class="solr.TextField">
  <analyzer type="index">
  <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" conf="C:/workspace/solr_home/solr/mysolr/conf/word.local.conf"/>
  </analyzer>
  <analyzer type="query">
  <tokenizer class="org.apdplat.word.solr.ChineseWordTokenizerFactory" conf="C:/workspace/solr_home/solr/mysolr/conf/word.local.conf"/>
  </analyzer>
  </fieldType>
6. 在文件夾C:\workspace\solr_home\solr\mysolr\conf\下新建文件word.local.conf
7. 從github中複製word.conf的配置內容，複製dic.txt，stopwords.txt
8. 修改word.local.conf文件
  dic.path=classpath:dic.txt,classpath:custom_dic,C:/workspace/solr_home/solr/mysolr/conf/word_dic.txt
  stopwords.path=classpath:stopwords.txt，classpath:custom_stopwords_dic，C:/workspace/solr_home/solr/mysolr/conf/word_stopwords.txt
  
  修改後的word.local.conf所有內容以下：
  #是否啓用自動檢測功能，如：用戶自定義詞典、停用詞詞典
  auto.detect=true
  #詞典機制實現類，詞首字索引式前綴樹
  #dic.class=org.apdplat.word.dictionary.impl.DictionaryTrie
  #前綴樹詞首字索引分配空間大小，如太小則會致使碰撞增長，減少查詢性能
  dictionary.trie.index.size=24000
  #雙數組前綴樹，速度稍快一些，內存佔用稍少一些
  #但功能有限，不支持動態增減單個詞條，也不支持批量增減詞條
  #只支持先clear()後addAll()的動態改變詞典方式
  dic.class=org.apdplat.word.dictionary.impl.DoubleArrayDictionaryTrie
  #雙數組前綴樹預先分配空間大小，如不夠則逐漸遞增10%
  double.array.dictionary.trie.size=2600000
  #詞典，多個詞典之間逗號分隔開
  #如：dic.path=classpath:dic.txt,classpath:custom_dic,d:/dic_more.txt,d:/DIC,D:/DIC2
  #自動檢測詞庫變化，包含類路徑下的文件和文件夾、非類路徑下的絕對路徑和相對路徑
  #HTTP資源：dic.path=http://localhost:8080/word_web/resources/dic.txt
  dic.path=classpath:dic.txt,classpath:custom_dic,C:/workspace/solr_home/solr/mysolr/conf/word_dic.txt
  #是否利用多核提高分詞速度
  parallel.seg=true
  #詞性標註數據：part.of.speech.dic.path=http://localhost:8080/word_web/resources/part_of_speech_dic.txt
  part.of.speech.dic.path=classpath:part_of_speech_dic.txt
  #詞性說明數據：part.of.speech.des.path=http://localhost:8080/word_web/resources/part_of_speech_des.txt
  part.of.speech.des.path=classpath:part_of_speech_des.txt
  #二元模型路徑
  #HTTP資源：bigram.path=http://localhost:8080/word_web/resources/bigram.txt
  bigram.path=classpath:bigram.txt
  bigram.double.array.trie.size=5300000
  #三元模型路徑
  #HTTP資源：trigram.path=http://localhost:8080/word_web/resources/trigram.txt
  trigram.path=classpath:trigram.txt
  trigram.double.array.trie.size=9800000
  #是否啓用ngram模型，以及啓用哪一個模型
  #可選值有：no(不啓用)、bigram(二元模型)、trigram(三元模型)
  #如不啓用ngram模型
  #則雙向最大匹配算法、雙向最大最小匹配算法退化爲：逆向最大匹配算法
  #則雙向最小匹配算法退化爲：逆向最小匹配算法
  ngram=bigram
  #停用詞詞典，多個詞典之間逗號分隔開
  #如：stopwords.path=classpath:stopwords.txt，classpath:custom_stopwords_dic，d:/stopwords_more.txt
  #自動檢測詞庫變化，包含類路徑下的文件和文件夾、非類路徑下的絕對路徑和相對路徑
  #HTTP資源：stopwords.path=http://localhost:8080/word_web/resources/stopwords.txt
  stopwords.path=classpath:stopwords.txt，classpath:custom_stopwords_dic，C:/workspace/solr_home/solr/mysolr/conf/word_stopwords.txt
  #用於分割詞的標點符號，目的是爲了加速分詞，只能爲單字符
  #HTTP資源：punctuation.path=http://localhost:8080/word_web/resources/punctuation.txt
  punctuation.path=classpath:punctuation.txt
  #分詞時截取的字符串的最大長度
  intercept.length=16
  #百家姓，用於人名識別
  #HTTP資源：surname.path=http://localhost:8080/word_web/resources/surname.txt
  surname.path=classpath:surname.txt
  #數量詞
  #HTTP資源：quantifier.path=http://localhost:8080/word_web/resources/quantifier.txt
  quantifier.path=classpath:quantifier.txt
  #是否啓用人名自動識別功能
  person.name.recognize=true
  #是否保留空白字符
  keep.whitespace=false
  #是否保留標點符號，標點符號的定義見文件：punctuation.txt
  keep.punctuation=false
  #將最多多少個詞合併成一個
  word.refine.combine.max.length=3
  #對分詞結果進行微調的配置文件
  word.refine.path=classpath:word_refine.txt
  #同義詞詞典
  word.synonym.path=classpath:word_synonym.txt
  #反義詞詞典
  word.antonym.path=classpath:word_antonym.txt
  #lucene、solr、elasticsearch、luke等插件是否啓用標註
  tagging.pinyin.full=false
  tagging.pinyin.acronym=false
  tagging.synonym=false
  tagging.antonym=false
  #是否啓用識別工具，來識別文本（英文單詞、數字、時間等）
  recognition.tool.enabled=true
  #若是你想知道word分詞器的詞典中究竟加載了哪些詞
  #可在配置項dic.dump.path中指定一個文件路徑
  #word分詞器在加載詞典的時候，順便會把詞典的內容寫到指定的文件路徑
  #可指定相對路徑或絕對路徑
  #如：
  #dic.dump.path=dic.dump.txt
  #dic.dump.path=dic.dump.txt
  #dic.dump.path=/Users/ysc/dic.dump.txt
  dic.dump.path=
  #redis服務，用於實時檢測HTTP資源變動
  #redis主機
  redis.host=localhost
  #redis端口
  redis.port=6379
9. 修改文件C:/workspace/solr_home/solr/mysolr/conf/word_dic.txt，添加字庫：同程
10. 重啓tomcat
11. 驗證分詞結果，如圖：
使用mmseg4j分詞器
1. 下載mmseg4j，如：mmseg4j-core-1.10.1-SNAPSHOT.jar，mmseg4j-solr-2.3.1-SNAPSHOT.jar，字典文件夾：data/
2. 將mmseg4j-core-1.10.1-SNAPSHOT.jar，mmseg4j-solr-2.3.1-SNAPSHOT.jar拷貝至文件夾C:\workspace\Tomcat7.0\webapps\solr\WEB-INF\lib\下
3. 修改以下文件C:\workspace\solr_home\solr\mysolr\conf\schema.xml
在schema節點下添加以下節點：
<fieldtype name="textComplex" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex"/>
</analyzer>
</fieldtype>
<fieldtype name="textMaxWord" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word"/>
</analyzer>
</fieldtype>
<fieldtype name="textSimple" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple"/>
</analyzer>
</fieldtype>
1. 添加分詞字段
  <field name="content_test" type="textMaxWord" indexed="true" stored="true" multiValued="true"/>
2. 重啓tomcat
3. 驗證分詞
4. 添加字典，修改以下文件C:\workspace\solr_home\solr\mysolr\conf\schema.xml
<fieldtype name="textComplex" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="data/dic/"/>
</analyzer>
</fieldtype>
<fieldtype name="textMaxWord" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="data/dic/"/>
</analyzer>
</fieldtype>
<fieldtype name="textSimple" class="solr.TextField" positionIncrementGap="100">
<analyzer>
<tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="data/dic/" />
</analyzer>
</fieldtype>
1. 將自帶的字典拷貝到C:\workspace\solr_home\solr\mysolr\data\dic\文件夾下，如圖：
2. 修改words.dic，添加"同程"關鍵字
3. 重啓tomcat
4. 驗證分詞