使用Solr索引MySQL數據

時間 2019-12-07

標籤使用 solr 索引 mysql 數據欄目 MySQL 简体版

原文原文鏈接

這裏使用的是mysql測試。html

一、先在mysql中建一個表：solr_testjava

二、插入幾條測試數據：mysql

三、用記事本打solrconfig.xml文件，在solrhome文件夾中。E:\solrhome\mycore\conf\solrconfig.xmlweb

(solrhome文件夾是什麼，參見：http://www.cnblogs.com/HD/p/3977799.html)sql

加入這個節點：數據庫

    <requestHandler name="/dataimport" class="org.apache.solr.handler.dataimport.DataImportHandler">
        <lst name="defaults">
            <str name="config">data-config.xml</str>
        </lst>
    </requestHandler>

四、新建一個data-config.xml文件，與solrconfig.xml同一個目錄下。內容爲apache

<dataConfig>
    <dataSource type="JdbcDataSource"
              driver="com.mysql.jdbc.Driver"
              url="jdbc:mysql://localhost:3306/test"
              user="root"
              password="root" />
    <document>
        <entity name="solr_test" transformer="DateFormatTransformer"
            query="SELECT id, subject, content, last_update_time FROM solr_test WHERE id >= ${dataimporter.request.id}">
            <field column='last_update_time' dateTimeFormat='yyyy-MM-dd HH:mm:ss' />
        </entity>
    </document></dataConfig>

說明：這裏使用了一個${dataimporter.request.id}，這個是參數，後面在作數據導入時，會使用到，以此條件爲基準讀數據。json

五、複製解壓出的solr jar包solr-dataimporthandler-4.10.0.jar和solr-dataimporthandler-extras-4.10.0.jar到tomcat solr webapp的WEB-INF\lib目錄下。tomcat

固然，也包括mysql的jdbc jar包：mysql-connector-java-5.1.7-bin.jarmybatis

（還有一種方法是在solrconfig.xml中加入lib節點，而後把jar包放到solrhome下，這樣能夠不在WEB-INF\lib中加入jar包）

六、用記事本打開schema.xml，在在solrhome文件夾中（同第3點）。內容爲：

<?xml version="1.0" ?><schema name="my core" version="1.1">

    <fieldtype name="string"  class="solr.StrField" sortMissingLast="true" omitNorms="true"/>
    <fieldType name="long" class="solr.TrieLongField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="date" class="solr.TrieDateField" precisionStep="0" positionIncrementGap="0"/>
    <fieldType name="text_cn" class="solr.TextField">
        <analyzer type="index" class="org.wltea.analyzer.lucene.IKAnalyzer" />
        <analyzer type="query" class="org.wltea.analyzer.lucene.IKAnalyzer" />
    </fieldType>
    
    <!-- general -->
    <field name="id" type="string" indexed="true" stored="true" multiValued="false" required="true"/>
    <field name="subject" type="text_cn" indexed="true" stored="true" />
    <field name="content" type="text_cn" indexed="true" stored="true" />
    <field name="last_update_time" type="date" indexed="true" stored="true" />
    <field name="_version_" type="long" indexed="true" stored="true"/>

     <!-- field to use to determine and enforce document uniqueness. -->
     <uniqueKey>id</uniqueKey>

     <!-- field for the QueryParser to use when an explicit fieldname is absent -->
     <defaultSearchField>subject</defaultSearchField>

     <!-- SolrQueryParser configuration: defaultOperator="AND|OR" -->
     <solrQueryParser defaultOperator="OR"/></schema>

七、更新zookeeper集羣配置

solrctl instancedir --update collection1 /opt/cloudera/parcels/CDH/lib/solr/solr_configs

八、collection1加載新配置信息

http://localhost:8983/solr/admin/collections?action=RELOAD&name=collection1

九、打開solr web:

說明：

Custom Parameters填入id=1，這是在第4點中設置的參數。

Clean選項，是指是否刪除未匹配到的數據。也就是在數據庫select結果中沒有，而solr索引庫中存在，則刪除。

也可使用這個地址直接訪問：

http://localhost:8899/solr/mycore/dataimport?command=full-import&clean=true&commit=true&wt=json&indent=true&entity=solr_test&verbose=false&optimize=false&debug=false&id=1

將返回結果：

配置好後，以後咱們只須要使用這個url地址，就能夠不段的去導入數據作索引了。（就這麼簡單）

十、測試查詢：

固然，dataimport能夠加入參數命令，讓其從新加載data-config.xml

http://localhost:8899/solr/#/mycore/dataimport/command=reload-config

若是在數據庫中添加一條數據，可是Solr索引中沒有index這條數據，就查不到，因此通常在使用Solr檢索數據庫裏的內容時，都是先插入數據庫，再在Solr中index這條數據，使用Solr的模糊查詢或是分詞功能來檢索數據庫裏的內容。

DIH增量從MYSQL數據庫導入數據
已經學會了如何全量導入MySQL的數據，全量導入在數據量大的時候代價很是大，通常來講都會適用增量的方式來導入數據，下面介紹如何增量導入MYSQL數據庫中的數據，以及如何設置定時來作。

1）數據庫表的更改

前面已經建立好了一個User的表，這裏爲了可以進行增量導入，須要新增一個字段updateTime，類型爲timestamp，默認值爲CURRENT_TIMESTAMP。

有了這樣一個字段，Solr才能判斷增量導入的時候，哪些數據是新的。

由於Solr自己有一個默認值last_index_time，記錄最後一次作full import或者是delta import(增量導入）的時間，這個值存儲在文件conf目錄的dataimport.properties文件中。

2）data-config.xml中必要屬性的設置

transformer 格式轉化：HTMLStripTransformer 索引中忽略HTML標籤

query：查詢數據庫表符合記錄數據

deltaQuery：增量索引查詢主鍵ID 注意這個只能返回ID字段

deltaImportQuery：增量索引查詢導入的數據

deletedPkQuery：增量索引刪除主鍵ID查詢注意這個只能返回ID字段

有關「query」，「deltaImportQuery」，「deltaQuery」的解釋，引用官網說明，以下所示：
The query gives the data needed to populate fields of the Solr document in full-import
The deltaImportQuery gives the data needed to populate fields when running a delta-import
The deltaQuery gives the primary keys of the current entity which have changes since the last index time

若是須要關聯子表查詢，可能須要用到parentDeltaQuery

The parentDeltaQuery uses the changed rows of the current table (fetched with deltaQuery) to give the changed rows in theparent table. This is necessary because whenever a row in the child table changes, we need to re-generate the document which has that field.

更多說明看DeltaImportHandler的說明文檔。

針對User表，data-config.xml文件的配置內容以下：

<?xml version="1.0" encoding="UTF-8"?><dataConfig>
    <dataSource type="JdbcDataSource" driver="com.mysql.jdbc.Driver" url="jdbc:mysql://127.0.0.1:3306/mybatis" user="root" password="luxx" batchSize="-1" />　　<document name="testDoc">
        <entity name="user" pk="id"
                query="select * from user"
                deltaImportQuery="select * from user where id='${dih.delta.id}'"
                deltaQuery="select id from user where updateTime> '${dataimporter.last_index_time}'">　　　         <field column="id" name="id"/>　　　        <field column="userName" name="userName"/>
            <field column="userAge" name="userAge"/>
            <field column="userAddress" name="userAddress"/><field column="updateTime" name="updateTime"/>　　　  </entity>　　</document></dataConfig>

增量索引的原理是從數據庫中根據deltaQuery指定的SQL語句查詢出全部須要增量導入的數據的ID號。

而後根據deltaImportQuery指定的SQL語句返回全部這些ID的數據，即爲此次增量導入所要處理的數據。

核心思想是：經過內置變量「${dih.delta.id}」和「${dataimporter.last_index_time}」來記錄本次要索引的id和最近一次索引的時間。

注意：剛新加上的updateTime字段也要在field屬性中配置，同時也要在schema.xml文件中配置：

<field name="updateTime" type="date" indexed="true" stored="true" />

若是業務中還有刪除操做，能夠在數據庫中加一個isDeleted字段來代表該條數據是否已經被刪除，這時候Solr在更新index的時候，能夠根據這個字段來更新哪些已經刪除了的記錄的索引。

這時候須要在dataConfig.xml中添加：

query="select * from user where isDeleted=0"
deltaImportQuery="select * from user where id='${dih.delta.id}'"
deltaQuery="select id from user where updateTime> '${dataimporter.last_index_time}' and isDeleted=0"
deletedPkQuery="select id from user where isDeleted=1"

這時候Solr進行增量索引的時候，就會刪除數據庫中isDeleted=1的數據的索引。

測試增量導入

若是User表裏有數據，能夠先清空之前的測試數據（由於加的updateTime沒有值），用個人Mybatis測試程序添加一個User，數據庫會以當前時間賦值給該字段。在Solr中使用Query查詢全部沒有查詢到該值，使用dataimport?command=delta-import 增量導入，再次查詢全部就能夠查詢到剛剛插入到MySQL的值。

設置增量導入爲定時執行的任務

能夠用Windows計劃任務，或者Linux的Cron來按期訪問增量導入的鏈接來完成定時增量導入的功能，這其實也是能夠的，並且應該沒什麼問題。

可是更方便，更加與Solr自己集成度高的是利用其自身的定時增量導入功能。

一、下載apache-solr-dataimportscheduler-1.0.jar放到\solr-webapp\webapp\WEB-INF\lib目錄下：
下載地址：http://code.google.com/p/solr-dataimport-scheduler/downloads/list
也能夠到百度雲盤下載：http://pan.baidu.com/s/1dDw0MRn

注意：apache-solr-dataimportscheduler-1.0.jar有bug，參考：http://www.denghuafeng.com/post-242.html

二、修改solr的WEB-INF目錄下面的web.xml文件：
爲<web-app>元素添加一個子元素

<listener>
        <listener-class>
    org.apache.solr.handler.dataimport.scheduler.ApplicationListener        </listener-class>
    </listener>

三、新建配置文件dataimport.properties：

在SOLR_HOME\solr目錄下面新建一個目錄conf（注意不是SOLR_HOME\solr\collection1下面的conf），而後用解壓文件打開apache-solr-dataimportscheduler-1.0.jar文件，將裏面的 dataimport.properties文件拷貝過來，進行修改，下面是最終個人自動定時更新配置文件內容：

#################################################
#                                               #
#       dataimport scheduler properties         #
#                                               #
#################################################

#  to sync or not to sync
#  1 - active; anything else - inactive
syncEnabled=1

#  which cores to schedule
#  in a multi-core environment you can decide which cores you want syncronized
#  leave empty or comment it out if using single-core deployment
#  syncCores=game,resource
syncCores=collection1

#  solr server name or IP address
#  [defaults to localhost if empty]
server=localhost

#  solr server port
#  [defaults to 80 if empty]
port=8983

#  application name/context
#  [defaults to current ServletContextListener's context (app) name]
webapp=solr

#  URLparams [mandatory]
#  remainder of URL
#http://localhost:8983/solr/collection1/dataimport?command=delta-import&clean=false&commit=true
params=/dataimport?command=delta-import&clean=false&commit=true

#  schedule interval
#  number of minutes between two runs
#  [defaults to 30 if empty]
interval=1

#  重作索引的時間間隔，單位分鐘，默認7200，即1天; 
#  爲空,爲0,或者註釋掉:表示永不重作索引
# reBuildIndexInterval=2

#  重作索引的參數
reBuildIndexParams=/dataimport?command=full-import&clean=true&commit=true

#  重作索引時間間隔的計時開始時間，第一次真正執行的時間=reBuildIndexBeginTime+reBuildIndexInterval*60*1000；
#  兩種格式：2012-04-11 03:10:00 或者  03:10:00，後一種會自動補全日期部分爲服務啓動時的日期
reBuildIndexBeginTime=03:10:00

這裏爲了作測試每1分鐘就進行一次增量索引，同時disable了full-import全量索引。

四、測試

在數據庫中插入一條數據，在Solr Query中查詢，剛開始查不到，Solr進行一次增量索引後就能夠查詢到了。

通常來講要在你的項目中引入Solr須要考慮如下幾點：
一、數據更新頻率：天天數據增量有多大，及時更新仍是定時更新
二、數據總量：數據要保存多長時間
三、一致性要求：指望多長時間內看到更新的數據，最長容許多長時間延遲
四、數據特色：數據源包括哪些，平均單條記錄大小
五、業務特色：有哪些排序要求，檢索條件
六、資源複用：已有的硬件配置是怎樣的，是否有升級計劃

參考：

http://wiki.apache.org/solr/DataImportHandler

http://wiki.apache.org/solr/Solrj

http://www.denghuafeng.com/post-242.html