SOLR (全文檢索)

SOLR (全文檢索)

 

http://sinykk.iteye.com/php

1.   什麼是SOLR

官方網站html

http://wiki.apache.org/solrjava

http://wiki.apache.org/solr/DataImportHandlernode

本文檔以solr3.4   tomcat6.3  IKAnalyzer3.2.5Stable爲例mysql

 

 

1.1. 什麼是SOLR

 

Solr是一個高性能,採用Java5開發,基於Lucene的全文搜索服務器。同時對其進行了擴展,提供了比Lucene更爲豐富的查詢語言,同時實現了可配置、可擴展並對查詢性能進行了優化,而且提供了一個完善的功能管理界面,是一款很是優秀的全文搜索引擎。linux

文檔經過Http利用XML 加到一個搜索集合中。查詢該集合也是經過http收到一個XML/JSON響應來實現。它的主要特性包括:高效、靈活的緩存功能,垂直搜索功能,git

 

 

 

1.2. 在什麼場合使用

一、  你搜索數據庫數據時你的主鍵不是整形的,多是UUIDweb

二、  搜索任何文本類文檔,甚至包括RSS,EMAIL等sql

 

 

2.   如何使用solr

 

經過在WINDOWS或LINUX服務器安裝SOLR服務器,並配置上相應的索引規則,經過JAVA或PHP等腳本語言進行調用和查詢數據庫

2.1. Window下安裝solr

  1. 1.  下載所需軟件,安裝配置Tomcat。

下載軟件爲 :Tomcat與Solr,jdk1.6,官網均可免費下載。

  1. 2.  Tomcat 配置文件conf\server.xml

添加編碼的配置 URIEncoding="UTF-8" (如不添加,中文檢索時由於亂碼搜索不到)。
添加後爲:
<Connector port="8983" protocol="HTTP/1.1" connectionTimeout="20000"
redirectPort="8443" URIEncoding="UTF-8" />

  1. 3.  將D:\solr\apache-solr-3.3.0 解壓

5. 創建d:/solr/home主目錄(能夠根據本身的狀況創建),把D:\solr\apache-solr-3.3.0\example\solr複製到該目錄下。

6. 創建solr.home 環境變量:置爲 d:/solr/home

7. 將solr.War複製到tomcat的webapp下啓動是會自動解壓。

8. 修改D:\resouce\java\tomcat\webapps\solr\WEB-INF\web.xml.

<env-entry>
<env-entry-name>solr/home</env-entry-name>
<env-entry-value>d:\solr\home</env-entry-value>
<env-entry-type>java.lang.String</env-entry-type>
</env-entry>
9. 啓動tomcat,瀏覽器輸入:http://localhost:8080/solr/
10.看到頁面說明部署成功

 

      

2.2. linux下安裝solr

此linux安裝版結合直接安裝帶有分詞功能

一、將TOMCAT解壓到 /usr/local/apache-tomcat-6.0.33/

二、將 /solr/apache-solr-3.3.0/example/solr 文件拷貝到 /usr/local/apache-tomcat-6.0.33/

三、而後修改TOMCAT的/usr/local/apache-tomcat-6.0.33/conf/server.xml【增長中文支持】

<Connector port="8983" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443" URIEncoding="UTF-8"/>

<Connector port="8983" protocol="HTTP/1.1"

               connectionTimeout="20000"

               redirectPort="8443" URIEncoding="UTF-8"/>

四、添加文件 /usr/local/apache-tomcat-6.0.33/conf/Catalina/localhost/solr.xml 內容以下

<?xml version="1.0" encoding="UTF-8"?>

<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >

   <Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />

</Context>

<?xml version="1.0" encoding="UTF-8"?>

<Context docBase="/usr/local/apache-tomcat-6.0.33/webapps/solr" debug="0" crossContext="true" >

   <Environment name="solr/home" type="java.lang.String" value="/usr/local/apache-tomcat-6.0.33/solr" override="true" />

</Context>

五、將/sinykk/solr/apache-solr-3.3.0/example/webapps/solr.war文件放到/usr/local/apache-tomcat-6.0.33/webapps文件夾下,並啓動TOMCAT

六、將/sinykk/solr/IKAnalyzer3.2.8.jar 文件放到/usr/local/apache-tomcat-6.0.33/webapps/solr/WEB-INF/lib 目錄下


七、修改/usr/local/apache-tomcat-6.0.33/solr/conf/schema.xml文件爲

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

 <types>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

     <!--

    <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

    -->

 

     <fieldType name="textik" class="solr.TextField" >

               <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> 

      

               <analyzer type="index"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

              <analyzer type="query"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

      

</fieldType>

 </types>

 

 

 <fields>

  <field name="id" type="string" indexed="true" stored="true" required="true" />

 </fields>

 

 <uniqueKey>id</uniqueKey>

 

</schema>

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

 <types>

    <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

   <!--

  <fieldType name="text_general" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

  -->

 

   <fieldType name="textik" class="solr.TextField" >

               <analyzer class="org.wltea.analyzer.lucene.IKAnalyzer"/> 

      

               <analyzer type="index"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

           <analyzer type="query"> 

                   <tokenizer class="org.wltea.analyzer.solr.IKTokenizerFactory" isMaxWordLength="false"/> 

                   <filter class="solr.StopFilterFactory" 

                           ignoreCase="true" words="stopwords.txt"/> 

                   <filter class="solr.WordDelimiterFilterFactory" 

                           generateWordParts="1" 

                           generateNumberParts="1" 

                           catenateWords="1" 

                           catenateNumbers="1" 

                           catenateAll="0" 

                           splitOnCaseChange="1"/> 

                   <filter class="solr.LowerCaseFilterFactory"/> 

                   <filter class="solr.EnglishPorterFilterFactory" 

                       protected="protwords.txt"/> 

                   <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> 

               </analyzer> 

      

</fieldType>

 </types>

 

 

 <fields>

  <field name="id" type="string" indexed="true" stored="true" required="true" />

 </fields>

 

 <uniqueKey>id</uniqueKey>

 

</schema>

最後運行http://192.168.171.129:8983/solr/admin/analysis.jsp

 

 

 

2.3. solr 將MYSQL數據庫作成索引數據源

solr 將MYSQL數據庫作成索引數據源【注意格式】

參考:http://digitalpbk.com/apachesolr/apache-solr-mysql-sample-data-config

  1. 1.  在solrconfig.xml中添加,增長導入數據功能

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

       <lst name="defaults">  

           <str name="config">data-config.xml</str>  

       </lst>  

</requestHandler>

 

 

 

  1. 2.  添加一個數據源data-config.xml,代碼以下

 

<dataConfig>

    <dataSource type="JdbcDataSource"

   driver="com.mysql.jdbc.Driver"

   url="jdbc:mysql://localhost/test"

   user="root"

   password=""/>

    <document name="content">

        <entity name="node" query="select id,name,title from solrdb">

            <field column="nid" name="id" />

            <field column="name" name="name" />

            <field column="title" name="title" />

        </entity>

    </document>

</dataConfig>

 

  1. 3.  3、建立schema.xml語法,代碼以下

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

  <types>   

     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

 

     <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

 

     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

     <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> 

</types>

 

 

 <fields>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="title" type="string" indexed="true" stored="true"/>

   <field name="contents" type="text" indexed="true" stored="true"/>

 </fields>

 

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>contents</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

<copyField source="title" dest="contents"/>

 

</schema>

<?xml version="1.0" encoding="UTF-8" ?>

<schema name="example" version="1.4">

  <types>   

     <fieldType name="tint" class="solr.TrieIntField" precisionStep="8" omitNorms="true" positionIncrementGap="0"/>

 

     <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true"/>

 

     <fieldType name="text" class="solr.TextField" positionIncrementGap="100">

      <analyzer type="index">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

      <analyzer type="query">

        <tokenizer class="solr.StandardTokenizerFactory"/>

        <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />

        <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true"/>

        <filter class="solr.LowerCaseFilterFactory"/>

      </analyzer>

    </fieldType>

 

     <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0"/> 

</types>

 

 

 <fields>

   <field name="id" type="string" indexed="true" stored="true" required="true" />

   <field name="title" type="string" indexed="true" stored="true"/>

   <field name="contents" type="text" indexed="true" stored="true"/>

 </fields>

 

 <uniqueKey>id</uniqueKey>

 <defaultSearchField>contents</defaultSearchField>

 <solrQueryParser defaultOperator="OR"/>

<copyField source="title" dest="contents"/>

 

</schema>

schema.xml 裏重要的字段

要有這個copyField字段SOLR才能檢索多個字段的值【如下設置將同時搜索 title,name,contents中的值】
<defaultSearchField>contents</defaultSearchField>
copyField是用來複製你一個欄位裡的值到另外一欄位用. 如你能夠將name裡的東西copy到default裡, 這樣solr作檢索時也會檢索到name裡的東西.
<copyField source="name" dest="contents"/>
<copyField source="title" dest="contents"/>

四、建立索引

http://192.168.171.129:8983/solr/dataimport?command=full-import

注:保證與數據庫鏈接正確

 

 

 

2.4. SOLR多個索引共存 multiple core

參考:http://wiki.apache.org/solr/CoreAdmin

  1. 1.  配置多個索引

<solr persistent="true" sharedLib="lib">

 <cores adminPath="/admin/cores">

  <core name="core0" instanceDir="core0" dataDir="D:\solr\home\core0\data"/>

  <core name="core1" instanceDir="core1" dataDir="D:\solr\home\core1\data" />

 </cores>

</solr>

2、將D:\solr\apache-solr-3.3.0\example\multicore下的 core0,core1兩個文件拷貝到D:\solr\home下,D:\solr\home目錄下以前的任務目錄及文件不變

注:D:\solr\home目錄爲D:\solr\apache-solr-3.3.0\example\solr


3、創建兩個索引數據存放目錄
D:\solr\home\core0\data
D:\solr\home\core1\data

4、修改其中一個索引如CORE1
修改solrconfig.xml爲以下代碼
【注 須要加入 lib 標籤主要是由於DataImportHandler 爲報錯,這多是官方的BUG】

 

<?xml version="1.0" encoding="UTF-8" ?>

<config>

  <luceneMatchVersion>LUCENE_33</luceneMatchVersion>

  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

 

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />

  <lib dir="/total/crap/dir/ignored" />

  <updateHandler class="solr.DirectUpdateHandler2" />

 

  <requestDispatcher handleSelect="true" >

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

  </requestDispatcher>

 

  <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

  <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

  <admin>

    <defaultQuery>solr</defaultQuery>

  </admin>

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

                 <lst name="defaults">  

                          <str name="config">data-config.xml</str>  

                 </lst>  

</requestHandler>

</config>

<?xml version="1.0" encoding="UTF-8" ?>

<config>

  <luceneMatchVersion>LUCENE_33</luceneMatchVersion>

  <directoryFactory name="DirectoryFactory" class="${solr.directoryFactory:solr.StandardDirectoryFactory}"/>

 

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/extraction/lib" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-cell-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-clustering-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/dist/" regex="apache-solr-dataimporthandler-\d.*\.jar" />

  <lib dir="D:/solr/apache-solr-3.3.0/contrib/clustering/lib/" />

  <lib dir="/total/crap/dir/ignored" />

  <updateHandler class="solr.DirectUpdateHandler2" />

 

  <requestDispatcher handleSelect="true" >

    <requestParsers enableRemoteStreaming="false" multipartUploadLimitInKB="2048" />

  </requestDispatcher>

 

  <requestHandler name="standard" class="solr.StandardRequestHandler" default="true" />

  <requestHandler name="/update" class="solr.XmlUpdateRequestHandler" />

  <requestHandler name="/admin/" class="org.apache.solr.handler.admin.AdminHandlers" />

  <admin>

    <defaultQuery>solr</defaultQuery>

  </admin>

<requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">  

      <lst name="defaults">  

         <str name="config">data-config.xml</str>  

      </lst>  

</requestHandler>

</config>

最後運行 http://localhost:8080/solr/core1/admin/

 

 

2.5. 全自動近實時全文檢索(增量索引)

 

每次檢索都檢索至上次創建的索引基礎上,因此當有新增數據時,不通過處理是沒法檢索到新增數據的。這時須要進行相關配置來實現實時檢索

 

思路:設置兩個數據源和兩個索引,對不多更新或根本不更新的數據創建主索引,而對新增文檔創建增量索引

 

主要是修改data-config.xml 數據源

 

<dataConfig>

    <dataSource driver="com.mysql.jdbc.Driver" url="jdbc:mysql://localhost/demo" user="root" password=""/>

    <document name="products">

       <entity name="item" pk="id"

          query="SELECT id,title,contents,last_index_time FROM solr_articles"

          deltaImportQuery="SELECT id,title,contents,last_index_time FROM solr_articles

            WHERE id = '${dataimporter.delta.id}'"

          deltaQuery="SELECT id FROM solr_articles

            WHERE last_index_time > '${dataimporter.last_index_time}'">

        </entity>

    </document>

</dataConfig>

注意數據庫相關表的建立

如本例中 solr_articles表中有 last_index_time(timestamp)字段,每當增長或者更新了值都應修改last_index_time的值,以便增量索引能更新到

有問題請即時查看TOMCAT的LOG日誌文件

運行:http://192.168.171.129:8983/solr/dataimport?command=delta-import

若是運行後未達到你的預期,請查看dataimport.properties文件的日期,並組合新SQL語句查詢來調整問題

 

 

 

作好主索引和增量索引後就須要創建兩個定時任務(linux crontab)

 

一個每五分鐘的增量索引定時任務:每五分鐘更新一次增量索引,同時合併主索引和增量索引以此保證能檢索出五分鐘之前的全部數據

 

一個天天凌晨兩點的主索引更新,同時清除增量索引,以此來保證主索引的效率,同時減小數據的重複性

 

 

 

2.6. 分佈式索引

solr 分佈式實際上是分發,這概念像Mysql的複製。全部的索引的改變都在主服務器裏,全部的查詢都在從服務裏。從服務器不斷地(定時)從主服務器拉內容,以保持數據一致。

參考:http://chenlb.blogjava.net/archive/2008/07/04/212398.html

 

 

2.7. 解決數據準確性

 

要想搜索出的數據準確你能夠經過如下幾種方式來解決

一、  創建本身的分詞庫

二、  在對數據進行了更新,添加,刪除時經過DOCUMENT來更新索引

三、  採用增量索引,進行定時更新

 

2.8. SOLR分詞的配置

參考本文檔的LINUX安裝SOLR

 

2.9. SOLR的PHP客戶端

使用PHP訪問SOLR中的索引數據

參考:http://code.google.com/p/solr-php-client/

一個簡單的例子:http://code.google.com/p/solr-php-client/wiki/ExampleUsage

 

注:與用C寫的SPHINX搜索引擎類似

 

3.   其它參考

相關文章
相關標籤/搜索