nutch和solr創建搜索引擎基礎（單機版）

時間 2019-12-09

原文原文鏈接

nutch和solr創建搜索引擎基礎（單機版）

Nutch[1] 是一個開源Java實現的搜索引擎，它提供了咱們運行本身的搜索引擎所需的所有工具，包括全文搜索和Web爬蟲。
Solr[2]是一個基於Lucene的全文搜索服務器，它對外提供相似於Web-service的API查詢接口，是一款很是優秀的全文搜索引擎。html

爲何要整合nutch和solr？

簡單地講，nutch重在提供數據源採集（Web爬蟲）能力，輕全文搜索(lucene)能力；solr是lucene的擴展，亦是nutch的全文搜索的擴展。重在將nutch的爬取結果，經過其對外提供檢索服務。java

1、版本選擇

1. nutch-1.13

支持hadoop，能夠經過hadoop，得到分佈式爬蟲的能力。本文重點介紹nutch的原力，關於分佈式爬蟲，將在後續章節中介紹。另外，nutch-2.x系列支持hbase，能夠根據自身的須要靈活選擇。須要說明的是兩版的用法是不一樣的，nutch-2.x要更爲複雜。在使用nutch-2.x以前，最好具有nutch-1.x的基礎。node

2. Solr-6.6.0

截止發稿時是最新版本，可參考官網的解釋，這裏沒什麼要說的。web

2、安裝環境準備

1. 系統環境

Ubuntu14.04x64 或 Centos6.5x64, 應用程序採用二進制安裝，不要求編譯環境數據庫

2. java環境

vim /etc/profile
# set for java
export JAVA_HOME=/opt/jdk1.8.0_111  #二進制包已經解壓安裝到該路徑
export PATH=$PATH:$JAVA_HOME/bin
export CLASSPATH=.:$CLASSPATH:$JAVA_HOME/lib
export _JAVA_OPTIONS="Xmx2048m XX:MaxPermSize=512m Djava.awt.headless=true"

注：java的安裝方法選擇二進制包便可，本文再也不贅述apache

3. nutch+solr環境

vim /etc/profile
export NUTCH_RUNTIME_HOME='/opt/apache-nutch-1.13'
export PATH=$NUTCH_RUNTIME_HOME/bin:$PATH
export APACHE_SOLR_HOME='/opt/solr-6.6.0'    #單引號不能少
export PATH=$APACHE_SOLR_HOME/bin:$PATH
export CLASSPATH=.:$CLASSPATH:$APACHE_SOLR_HOME/server/lib

source /etc/profile #加載到環境vim

3、solr的安裝與配置

solr的安裝（二進制包）

wget http://mirror.bit.edu.cn/apache/lucene/solr/6.6.0/solr-6.6.0.tgz
cat solr-6.6.0.tgz |(cd /opt; tar xzfp -)
solr status  #注：若是執行結果不正常，執行`source /etc/profile`和檢查該文件的內容
No Solr nodes are running.
#啓動solr服務
solr start -force   #-force：強制以root身份執行，生產環境請勿使用該參數
#中止solr服務
solr stop

安裝完畢。瀏覽器

solr的配置

cd ${APACHE_SOLR_HOME}
cp -r server/solr/configsets/basic_configs server/solr/configsets/nutch
cp ${NUTCH_RUNTIME_HOME}/conf/schema.xml server/solr/configsets/nutch/conf
mv server/solr/configsets/nutch/conf/managed-schema server/solr/configsets/nutch/conf/managed-schema.backup
#啓動solr服務
solr start
#建立nutch core
solr create -c nutch -d server/solr/configsets/nutch/conf/ -force #-force：強制以root身份執行，生產環境請勿使用該參數

建立過程並不是一路順風，整個過程充滿了各類bug，從這個角度考慮，生產環境中有必要更換到solr的穩定版，好在這些坑已經趟過：

問題1：Caused by: Unknown parameters: {enablePositionInc rements=true}
具體信息：
Copying configuration to new core instance directory:
/opt/solr-6.6.0/server/solr/nutch
Creating new core 'nutch' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=nutch&instanceDir=nutch
ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: Unknown parameters: {enablePositionIncrements=true}
解決辦法：
vim  server/solr/configsets/nutch/conf/schema.xml 找到並去掉enablePositionIncrements=true

問題2：ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: defaultSearchField has been deprecated and is incompatible with configs with luceneMatchVersion >= 6.6.0.  Use 'df' on requests instead.
解決辦法：
vim server/solr/configsets/nutch/conf/solrconfig.xml 將luceneMatchVersion版本修改成6.2.0

問題3：org.apache.solr.common.SolrException: fieldType 'booleans' not found in the schema
解決辦法:
vim /opt/solr-6.6.0/server/solr/configsets/nutch/conf/solrconfig.xml
找到booleans，替換成boolean，以下：
<lst name="typeMapping">
    <str name="valueClass">java.lang.Boolean</str>
    <str name="fieldType">boolean</str>
</lst>
Then it will work..

問題3之後，會發生多起相似事件，以下：
ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 'tdates' not found in the schema
ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 'tlongs' not found in the schema
ERROR: Error CREATEing SolrCore 'nutch': Unable to create core [nutch] Caused by: fieldType 'tdoubles' not f ound in the schema
參照問題3的方法，一次性去掉''中關鍵字的複數形式便可。

問題4：ERROR:
Core 'nutch' already exists!
Checked core existence using Core API command:
http://localhost:8983/solr/admin/cores?action=STATUS&core=nutch
解決辦法：
solr delete -c nutch       #刪除core 'nutch'
若是刪除完，還提示這個錯誤，這是因爲每次修改完配置文件，須要重啓下solr服務，更新下狀態。

最終的執行結果：
solr create -c nutch -d server/solr/configsets/nutch/conf/ -force
Copying configuration to new core instance directory:
/opt/solr-6.6.0/server/solr/nutch

Creating new core 'nutch' using command:
http://localhost:8983/solr/admin/cores?action=CREATE&name=nutch&instanceDir=nutch

{
  "responseHeader":{
    "status":0,
    "QTime":3408},
  "core":"nutch"}
執行成功！

在瀏覽器中訪問
http://localhost:8983/solr/#/
能夠看到名稱爲nutch的core

solr的安全設置

1. realm.properties

cd /opt/solr-6.6.0/server
cat etc/realm.properties
#
# 這個文件定義用戶名,密碼和角色
#
# 格式以下
#  <username>: <password>[,<rolename> ...]
#
#userName: password,role
yourname: yourpass,admin

2. solr-jetty-context.xml

cat contexts/solr-jetty-context.xml
<?xml version="1.0"?>
<!DOCTYPE Configure PUBLIC "-//Jetty//Configure//EN" "http://www.eclipse.org/jetty/configure_9_0.dtd">
<Configure class="org.eclipse.jetty.webapp.WebAppContext">
  <Set name="contextPath"><Property name="hostContext" default="/solr"/></Set>
  <Set name="war"><Property name="jetty.base"/>/solr-webapp/webapp</Set>
  <Set name="defaultsDescriptor"><Property name="jetty.base"/>/etc/webdefault.xml</Set>
  <Set name="extractWAR">false</Set>
  <Get name="securityHandler">
    <Set name="loginService">
        <New class="org.eclipse.jetty.security.HashLoginService">
            <Set name="name">Solr Admin Access</Set>
            <Set name="config"><SystemProperty name="jetty.home" default="."/>/etc/realm.properties</Set>
        </New>
    </Set>
  </Get>
</Configure>

3. WEB-INF/web.xml

vim solr-webapp/webapp/WEB-INF/web.xml
  <!-- Get rid of error message -->
  <security-constraint>
    ...
  </security-constraint>

  <security-constraint>
  　　<web-resource-collection>
      　　<web-resource-name>Solr auth enticated application</web-resource-name>  <!--描述-->
      　　<url-pattern>/</url-pattern>  <!-- 驗證的網頁的位置-->
      </web-resource-collection>
       <auth-constraint>
          <role-name>admin</role-name>  <!-- 驗證的角色，別寫成用戶名，若有多個角色能夠寫多個role-name 標籤-->
       </auth-constraint>
   </security-constraint>
   <login-config>
          <auth-method>BASIC</auth-method>  <!-- 關鍵-->
          <realm-name>Solr Admin Access</realm-name>
  </login-config>

</web-app>

重啓solr服務
solr stop　&& solr start -force

在瀏覽器中訪問solr
http://localhost:8983/solr/nutch/
能夠看到要求登陸的界面

4、nutch的安裝與配置

nutch的安裝（二進制包）

wget http://mirrors.hust.edu.cn/apache/nutch/1.13/apache-nutch-1.13-bin.tar.gz
cat apache-nutch-1.13-bin.tar.gz |(cd /opt; tar xzfp -)
nutch -help   #注：若是執行結果不正常，執行`source /etc/profile`和檢查該文件的內容

安裝完畢。安全

nutch的配置

以爬取http://nutch.apache.org站點爲例，配置以下：服務器

1. 配置nutch-site.xml

vim $NUTCH_RUNTIME_HOME/conf/nutch-site.xml
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

#配置indexer-solr插件
#個人方法是替換indexer-elastic爲indexer-solr插件
sed -i 's/indexer-elastic/indexer-solr/g' $NUTCH_RUNTIME_HOME/conf/nutch-site.xml
#注意：官方文檔不是像我這樣作的，請按照個人方法配置，或者註釋掉indexer-elastic，不然會深受其害，踩坑過程後面會說

2. 創建URL列表

#爲方便修改配置文件，選擇conf文件夾做爲數據的存儲路徑
cd $NUTCH_RUNTIME_HOME/conf
mkdir -p urls       #存儲要爬取的URLS列表，每行只寫一個url，能夠多行
cd urls
echo 'http://nutch.apache.org/' > seed.txt  #地址能夠是靜態的連接，也能夠是動態的連接

3. 設置url的正則匹配規則

vim regex-urlfilter.txt
將光標移到文件末尾，將下列內容：
# accept anything else
+.
替換爲：
# accept anything else
 +^http://([a-z0-9]*\.)*nutch.apache.org/ #這將包含帶有域名前綴的url，好比，http://3w.nutch.apache.org

４. 容許抓取動態內容

vim crawl-urlfilter.txt regex-urlfilter.txt

替換：
  # skip URLs containing certain characters as probable queries, etc.
  -[?*!@=]
爲：
  # accept URLs containing certain characters as probable queries, etc.
  +[?=&]

注： conf下有各類配置文件，涉及各類爬取規則和正則過濾器。將在後續的文章中詳細說明

5、爬取和檢索的過程

1. nutch爬取程序的概念組成

抓取程序自動在用戶指定目錄下面創建爬取目錄，其目錄下能夠看到crawldb,segments,linkdb子目錄

1. crawldb(爬蟲數據庫)

crawldb目錄下面存放下載的URL,以及下載的日期、過時時間

2. linkdb-連接數據庫

linkdb目錄存放URL的關聯關係，是下載完成後分析時建立的，經過這個關聯關係能夠實現相似google的pagerank功能

3. segments-一組分片

segments目錄存儲抓取的頁面，這些頁面是根據層級關係分片的。既segments下面子目錄的個數與獲取頁面的層數有關係，若是指定「-depth」參數是10層，這個目錄下就有10層，結構清晰並防止文件過大。
segments目錄裏面有6個子目錄，分別是：

「crawl_generate」 生成要獲取的一組URL的名字，既生成待下載的URL的集合
「crawl_fetch」 包含獲取每一個UR L的狀態
」content「 包含從每一個URL檢索的原始內容
「parse_text」 包含每一個URL的解析文本（存放每一個解析過的URL的文本內容）
「parse_data」 包含從每一個URL分析的外部連接和元數據
「crawl_parse」 包含用於更新crawldb的outlink URL（外部連接庫）

2. nutch的爬取流程說明

爬取過程包括

injector -> generator -> fetcher -> parseSegment -> updateCrawleDB -> Invert links -> Index -> DeleteDuplicates -> IndexMerger

1. 根據以前建好的URL列表文件，將URL集註入crawldb數據庫---inject
2. 根據crawldb數據庫建立抓取列表---generate
3. 執行抓取，獲取網頁信息---fetch
4. 執行解析，解析網頁信息---parse
5. 更新數據庫，把獲取到的頁面信息存入數據庫中---updatedb
6. 重複進行2～4的步驟，直到預先設定的抓取深度。---這個循環過程被稱爲「產生/抓取/更新」循環
7. 根據sengments的內容更新linkdb數據庫---invertlinks
8. 創建索引---index

3. solr查詢流程包括

用戶經過用戶接口進行查詢操做
將用戶查詢轉化爲solr查詢
從索引庫中提取知足用戶檢索需求的結果集

6、爬取和檢索的實例

1. Inject

nutch inject crawl/crawldb urls

2. Generate

nutch generate crawl/crawldb crawl/segments

3. Fetching

s1=`ls -d crawl/segments/2* | tail -1`
echo $s1
nutch fetch $s1

4. Parse

nutch parse $s1

5. updatedb

nutch updatedb crawl/crawldb $s1

6. 循環抓取

重複2-4的過程，抓取下一層頁面
演示過程當中，爲了節約時間，咱們約定一個參數，只抓取前 top 1000 的頁面

nutch generate crawl/crawldb crawl/segments -topN 1000
s2=`ls -d crawl/segments/2* | tail -1`
echo $s2
nutch fetch $s2
nutch parse $s2

updatedb:
nutch updatedb crawl/crawldb $s2

重複2-4的過程，抓取下下層的頁面
一樣只取前1000個頁面進行抓取

nutch generate crawl/crawldb crawl/segments -topN 1000
s3=`ls -d crawl/segments/2* | tail -1`
echo $s3
nutch fetch $s3
nutch parse $s3

updatedb:
nutch updatedb crawl/crawldb $s3

這樣咱們總共抓取了三個層級深度的頁面

ls crawl/segments/
20170816191100  20170816191415  20170816192100

7. Invertlinks

nutch invertlinks crawl/linkdb -dir crawl/segments

8. Indexing into Apache Solr(推送solr index)

nutch index -Dsolr.server.url=http://用戶名:密碼@localhost:8983/solr/nutch crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20170816191100/ -filter -normalize -deleteGone
#這裏的「用戶名:密碼」是solr的jetty下的
值得一提的是，若是按照官方標準語法，上面命令會變爲：
nutch index -Dsolr.auth.username="yourname" -Dsolr.auth.password="yourpassword" -Dsolr.server.url=http://localhost:8983/solr/nutch crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20170816191100/ -filter -normalize -deleteGone
這裏會提示語法錯誤，在官網和google上尚未更好的解決辦法。我已經把上面的方法更新到
http://lucene.472066.n3.nabble.com/Nutch-authentication-problem-to-solr-td4251336.html#a4351038
可能其它版本沒有這個問題。
最終的推送結果以下：
nutch index -Dsolr.server.url=http://xxx:xxx@localhost:8983/solr/nutch crawl/crawldb/ -linkdb crawl/linkdb/ crawl/segments/20170816191100/ -filter -normalize -deleteGone
Segment dir is co mplete: crawl/segments/20170816191100.
Indexer: starting at 2017-08-17 18:22:26
Indexer: deleting gone documents: true
Indexer: URL filtering: true
Indexer: URL normalizing: true
Active IndexWriters :
SOLRIndexWriter
    solr.server.url : URL of the SOLR instance
    solr.zookeeper.hosts : URL of the Zookeeper quorum
    solr.commit.size : buffer size when sending to SOLR (default 1000)
    solr.mapping.file : name of the mapping file for fields (default solrindex-mapping.xml)
    solr.auth : use authentication (default false)
    solr.auth.username : username for authentication
    solr.auth.password : password for authentication


Indexing 1/1 documents
Deleting 0 documents
Indexer: number of documents indexed, deleted, or skipped:
Indexer:      1  indexed (add/update)
Indexer: finished at 2017-08-17 18:22:32, elapsed: 00:00:05