solr 4.8+mysql數據庫數據導入 + mmseg4j中文全文索引配置筆記

時間 2019-11-11

標籤 solr 4.8+mysql mysql 數據庫數據導入 mmseg4j mmseg 中文全文索引配置筆記欄目 MySQL 简体版

原文原文鏈接

轉載請標明出處：http://www.cnblogs.com/chlde/p/3768733.htmlhtml

1.如何將solr部署，請參考以前的文章mysql

2.按上述配置好後，在solr_home文件夾中，將包含collection1文件夾，這就是solr的一個實例。下面咱們來看看collection1中的文件內容。sql

collection1中包含conf和data兩個子文件夾。data中包含tlog和index（若是沒有也不要緊，稍後再solr創建索引時，將會被建立）。tlog是記錄日誌的文件夾，index是存放索引的文件夾。conf中包含lang文件夾和若干文件。lang文件夾中包含的是詞庫文件，可是solr默認是沒有中文詞庫的，因此以後會將中文詞庫加入該文件夾中。在conf中，包含了若干xml文件，咱們針對solr配置，是須要配置solrconfig.xml和schema.xml便可。下面咱們講一下如何配置這兩個文件。數據庫

3.先配置solrconfig.xml。solrconfig.xml是solr的核心文件。這裏包含了jar包引用，數據庫讀取路徑配置，操做接口配置。apache

jar包配置以下ide

 1     <lib dir="../contrib/extraction/lib" regex=".*\.jar" />
 2     <lib dir="../dist/" regex="solr-cell-\d.*\.jar" />
 3 
 4     <lib dir="../contrib/clustering/lib/" regex=".*\.jar" />
 5     <lib dir="../dist/" regex="solr-clustering-\d.*\.jar" />
 6 
 7     <lib dir="../contrib/langid/lib/" regex=".*\.jar" />
 8     <lib dir="../dist/" regex="solr-langid-\d.*\.jar" />
 9 
10     <lib dir="../contrib/velocity/lib" regex=".*\.jar" />
11     <lib dir="../dist/" regex="solr-velocity-\d.*\.jar" />
12     
13     <lib dir="../contrib/dataimporthandler/lib" regex=".*\.jar" />
14     <lib dir="../dist/" regex="solr-dataimporthandler-\d.*\.jar" />

其中，最後兩行是數據導入的handler，這包含了從數據庫讀取數據所須要的jar包。這些jar的目錄都在solr_home\contrib這個文件夾中。ui

配置dataimporthandlergoogle

　　<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
          <str name="config">data-config.xml</str>
        </lst>
    </requestHandler>

這裏須要你建立一個新的xml文件，放在conf文件夾中，命名爲data-config.xml。內容以下url

 1 <dataConfig>
 2    <dataSource type="JdbcDataSource" 
 3               driver="com.mysql.jdbc.Driver"
 4               url="jdbc:mysql://localhost/yourDBname" 
 5               user="root" 
 6               password="root"/>
 7    <document>

 8     <entity name="question1" query="select Guid,title,QuesBody,QuesParse,QuesType from question1 where Guid is not null">
 9        <field column="Guid" name="id"/>
10        <field column="title" name="question1_title"/>
11        <field column="QuesBody" name="question1_body"/>
12        <field column="QuesParse" name="question1_parse"/>
13        <field column="QuesType" name="question1_type"/>
14     </entity>
15     <entity name="question2" query="select Guid,title,QuesBody,QuesParse,QuesType from question2 where Guid is not null">
16        <field column="Guid" name="id"/>
17        <field column="title" name="question2_title"/>
18        <field column="QuesBody" name="question2_body"/>
19        <field column="QuesParse" name="question2_parse"/>
20        <field column="QuesType" name="question2_type"/>
21     </entity>
22   </document>
23 </dataConfig>

如上，包含了datasource和document兩個大標籤。datasource正如其名，包含了數據庫的配置信息。document包含了entity。entity就是一個從數據庫讀取數據的動做。spa

query就是讀取數據所用的sql，field是數據庫中的字段與schma中的字段進行匹配的列表。稍後在schma.xml的介紹中，將會詳細說明。

咱們回到solrconfig.xml中，requestHandler這裏定義了相應http請求的接口。如以前配置的name爲/dataimport接口，在中間件啓動後，訪問http://localhost:8080/solr/collection1/dataimport便可查看數據導入的狀態。若執行命令，即可執行http://localhost:8080/solr/collection1/dataimport?command=full-import 便可（這句的含義是所有從新索引，以前的索引將被刪除），其餘命令，請參考http://www.cnblogs.com/llz5023/archive/2012/11/15/2772154.html。同理，經過相同的形式，便可實現對solr的增刪改查。這裏還能對requestHandler進行一些高級配置，感興趣的同窗能夠到apache-solr-ref-guide-4.8中閱讀。

4.schma.xml配置。schma.xml完成了對索引數據的類型配置和索引一些相關動做的配置（如分詞方法配置）。

solr須要爲每條索引定義一個id做爲主鍵，並且在查詢中必需要有字段與主鍵id進行對應，不然將會報錯。如在data-config中的Guid與id進行匹配，將guid做爲主鍵。

field爲solr索引的基本類型，type的值與fieldType對應，即經過type爲每一個field指定一個fieldType，而fieldType將爲field規定如何進行索引。

例如，咱們將用mmseg4j對中文進行索引

 1 <!-- Chinese -->
 2     <fieldType name="text_chn_complex" class="solr.TextField" >
 3       <analyzer>
 4         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="complex" dicPath="lang/chn.txt"/>
 5       </analyzer>
 6     </fieldType>
 7     <fieldType name="text_chn_maxword" class="solr.TextField" >
 8       <analyzer>
 9         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="max-word" dicPath="lang/chn.txt"/>
10       </analyzer>
11     </fieldType>
12     <fieldType name="text_chn_simple" class="solr.TextField" >
13       <analyzer>
14         <tokenizer class="com.chenlb.mmseg4j.solr.MMSegTokenizerFactory" mode="simple" dicPath="lang/chn.txt"/>
15       </analyzer>
16     </fieldType>

如上，咱們定義了三個fieldType，這三個表示了對中文進行索引的三種方式。都屬於solr.TextField類。analyzer均爲mmseg4j，只是使用的mode不一樣。dicPath即爲詞庫所在位置。

1 <field name="question1_type" type="text_chn_maxword" indexed="true" stored="true"/>

這裏定義了一個名爲question1_type的field，使用text_chn_maxword方式進行索引。

這裏有一點是要注意的，solr中是沒有and的，因此，要在多個字段查詢匹配的關鍵字，要使用到copyField這個類型。

例如

1     <field name="question2_title" type="text_chn_maxword" indexed="true" stored="true"/>
2     <field name="question2_body" type="text_chn_maxword" indexed="true" stored="true"/>
3 
4     <field name="question2_text" type="text_chn_maxword" indexed="true" stored="true"  multiValued="true"/>
5     <copyField source="question2_title" dest="question2_text"/>
6     <copyField source="question2_body" dest="question2_text"/>

這裏就是將question2_title和question2_body共同索引到question2_text中，這樣只要question2_title或question2_body任意被關鍵字匹配，就會將question2_text返回。注意question2_text的multiValued="true"，這點是必須的。

5.遇到的問題

中文詞庫下載

http://download.labs.sogou.com/dl/sogoulabdown/SogouW/SogouW.zip

mmseg4j須要使用2.0以上版本，2.0一下在solr4.8中會有bug

https://code.google.com/p/mmseg4j/

JAVA工程師：chlde2500@gmail.com