一、下載相關軟件,並解壓html
版本號以下:java
(1)apache-nutch-2.3web
(2) hadoop-1.2.1
express
(3)hbase-0.92.1
apache
(4)solr-4.9.0
tomcat
並解壓至/opt/jediael。bash
若要下載最新的開發版本nutch,能夠進行如下操做elasticsearch
svn co https://svn.apache.org/repos/asf/nutch/branches/2.x
二、安裝hadoop1.2.1集羣環境ide
見http://blog.csdn.net/jediael_lu/article/details/38926477
svn
三、安裝hbase0.92.1集羣環境
見http://blog.csdn.net/jediael_lu/article/details/43086641
四、Nutch的配置
(1)vi /usr/search/apache-nutch-2.3/conf/nutch-site.xml
<property> <name>storage.data.store.class</name> <value>org.apache.gora.hbase.store.HBaseStore</value> <description>Default class for storing data</description> </property> <pre name="code" class="html"><property> <name>http.agent.name</name> <value>My Nutch Spider</value> </property>
默認狀況下,此語句被註釋掉,將其註釋符號去掉,使其生效。
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5" conf="*->default" />
gora-hbase 0.5對應hbase0.94.12
根據須要修改hadoop的版本:
<dependency org="org.apache.hadoop" name="hadoop-core" rev="1.2.1" conf="*->default」> <dependency org="org.apache.hadoop" name="hadoop-test" rev="1.2.1" conf="test->default」>
添加如下語句:
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
以上三個步驟指定了使用HBase來進行存儲。
(4)根據須要修改網頁過濾器
vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt
vi /usr/search/apache-nutch-2.3/conf/regex-urlfilter.txt
將
# accept anything else
+.
修改成
# accept anything else
+^http://([a-z0-9]*\.)*nutch.apache.org/
(9)增長索引內容
默認狀況下,schema.xml文件中的core及index-basic中的field纔會被索引,爲索引更多的field,能夠經過如下方式添加。
修改nutch-default.xml,新增如下紅色內容【或者只增長index-more】
<property>
<name>plugin.includes</name>
<value>protocol-http|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|urlnormalizer-(pass|regex|basic)|scoring-opic|index-anchor|index-more|languageidentifier|subcollection|feed|creativecommons|tld</value>
<description>Regular expression naming plugin directory names to
include. Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins. In order to use HTTPS please enable
protocol-httpclient, but be aware of possible intermittent problems with the
underlying commons-httpclient library.
</description>
</property>
或者能夠在nutch-site.xml中添加plugin.includes屬性,並將上述內容複製過去。注意,在nutch-site.xml中的屬性會代替nutch-default.xml中的屬性,所以必須將原有的屬性也複製過去。
(5)構建runtime
cd /usr/search/apache-nutch-2.3/
ant runtime
(6)驗證Nutch安裝完成
# cd /usr/search/apache-nutch-2.3/runtime/local/bin/
# ./nutch
Usage: nutch COMMAND
where COMMAND is one of:
inject inject new urls into the database
hostinject creates or updates an existing host table from a text file
generate generate new batches to fetch from crawl db
fetch fetch URLs marked during generate
parse parse URLs marked during fetch
updatedb update web table after parsing
updatehostdb update host table after parsing
readdb read/dump records from page database
readhostdb display entries from the hostDB
elasticindex run the elasticsearch indexer
solrindex run the solr indexer on parsed batches
solrdedup remove duplicates from solr
parsechecker check the parser for a given url
indexchecker check the indexing filters for a given url
plugin load a plugin and run one of its classes main()
nutchserver run a (local) Nutch server on a user defined port
junit runs the given JUnit test
or
CLASSNAME run the class named CLASSNAME
Most commands print help when invoked w/o parameters.
cd /usr/search/apache-nutch-2.3/runtime/deploy/bin/
vi seed.txt
http://nutch.apache.org/
hadoop fs -copyFromLocal seed.txt /
將seed.txt放到HDFS的根目錄下。
(8)在運行過程當中,會出現如下異常:
java.lang.RuntimeException: java.lang.ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
緣由未明。爲使抓取能正常繼續,先將crawl文件中的如下語句註釋掉
#echo "SOLR dedup -> $SOLRURL" #__bin_nutch solrdedup $commonOptions $SOLRURL
export CLASSPATH=$CLASSPATH:.....無效。
但使用local模式運行不會有以上的錯誤。
五、Solr的配置
(1)覆蓋solr的schema.xml文件。(對於solr4,應該使用schema-solr4.xml)
cp /usr/search/apache-nutch-2.3/conf/schema.xml /usr/search/solr-4.9.0/example/solr/collection1/conf/
(2)若使用solr3.6,則至此已經完成配置,但使用4.9,須要修改如下配置:【新版本已經不須要此步驟】
修改上述複製過來的schema.xml文件
刪除:<filter class="solr.EnglishPorterFilterFactory" protected="protwords.txt" />
增長:<field name="_version_" type="long" indexed="true" stored="true"/>
或者使用tomcat來運行solr,見http://blog.csdn.net/jediael_lu/article/details/37908885。
六、啓動抓取任務
(1)啓動hadoop
#start-all.sh
(2)啓動HBase
# ./start-hbase.sh
(3)啓動Solr
[# cd /usr/search/solr-4.9.0/example/
# java -jar start.jar
(4)啓動Nutch,開始抓取任務
將seed.txt複製到hdfs的根目錄下。
# cd /usr/search/apache-nutch-2.3/runtime/deploy
# bin/crawl /seed.txt TestCrawl http://localhost:8583/solr 2
大功告成,任務開始執行。
七、安裝過程當中可能出現的異常
異常一:No active index writer.
修改nutch-default.xml,在plugin.includes中增長indexer-solr。
異常二:ClassNotFoundException: org.apache.nutch.indexer.solr.SolrDeleteDuplicates$SolrInputFormat
在SolrDeleteDuplicates中的Job job = new Job(getConf(), "solrdedup");
後添加如下代碼:
job.setJarByClass(SolrDeleteDuplicates.class);
關於上述過程的一些分析請見:
集成Nutch/Hbase/Solr構建搜索引擎之二:內容分析
http://blog.csdn.net/jediael_lu/article/details/37738569
$ vi /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh #!/bin/bash export JAVA_HOME=/usr/java/jdk1.7.0_51 export PATH=$PATH:/opt/jediael/hadoop-1.2.1/bin/ /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/crawl /seed.txt `date +%h%d%H` http://master:8983/solr/ 2
0 0,9,12,15,19,21 * * * bash /opt/jediael/apache-nutch-2.3/runtime/deploy/bin/myCrawl.sh >> ~/nutch.log