1.安裝 antjava
下載並解壓ant安裝包node
tar -zxvf apache-ant-1.9.4-bin.tar.gzweb
mv apache-ant-1.9.4/ antapache
配置環境變量vim
vim .bash_profile 瀏覽器
#apache-ant-1.9.4tomcat
export ANT_HOME=/nutch/antbash
export ANT_BIN=/nutch/ant/bin服務器
export PATH=$PATH:$ANT_HOME/binapp
source .bash_profile
ant -version
Apache Ant(TM) version 1.9.4 compiled on April 29 2014
下載解壓怒提出安裝包
tar -zxvf apache-nutch-2.2.1-src.tar.gz
cd conf/nutch-site.xml
<property>
<name>storage.data.store.class</name>
<value>org.apache.gora.hbase.store.HBaseStore</value>
<description>Default class for storing data</description>
</property>
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit/537.36 (KHTML, like
Gecko) Chrome/28.0.1500.95 Safari/537.36</value>
</property>
vim ivy/ivy.xml文件,找到:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
conf="*->default" /> 並把原有的註釋去掉 修改ivy/ivy.xml文件,找到:
<dependency org="org.apache.gora" name="gora-hbase" rev="0.5"
conf="*->default" /> 並把原有的註釋去掉
<dependency org="org.apache.hadoop" name="hadoop-common"
rev="2.5.0" conf="*->default">
vim conf/gora.properies
gora.datastore.default=org.apache.gora.hbase.store.HBaseStore
cp /usr/local/hbase/conf/hbase-site.xml /usr/local/nutch/conf/
antruntime
##nutch 安裝完畢!!!
#####################################################
集成solr
[root@localhost solr]# tar –zxvf solr-4.7.0.tgz
2.2 啓動Jetty
這裏使用Solr自帶的Jetty服務器
[root@localhost solr]# cd solr-4.7.0/example
[root@localhost example]# java -jar start.jar
2.3 驗證
在瀏覽器輸入:http://192.168.4.150:8983/solr#/collection1/query中國人民解放軍
3.爲Solr配置IK分詞
3.1 下載IK-Analyzer-2012
解壓以後,將IKAnalyzer.cfg.xml、IKAnalyzer2012_FF.jar、stopword.dic三個文件上傳
到/usr/solr/solr-4.7.0/example/solr-webapp/webapp/WEB-INF/lib/目錄下
3.2 修改/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml配置文件
[root@localhost solr]# cd /usr/solr/solr-4.7.0/example/solr/collection1/conf/
[root@localhost solr]# vi schema.xml
在<type></types>中增長以下內容:
<fieldTypename="text_ik" class="solr.TextField">
<analyzer
type="index"isMaxWordLength="false"class="org.wltea.analyzer.lucene.IKAnalyzer"/>
<analyzer type="query"isMaxWordLength="true"
class="org.wltea.analyzer.lucene.IKAnalyzer"/>
</fieldType>
3.3 驗證
重啓Solr,打開http://10.192.87.198:8983/solr/#/collection1/analysis,測試一下:
4.6 集成Solr
編輯/usr/solr/solr-4.7.0/example/solr/collection1/conf/schema.xml文件,在<field>…
</fields>中增長以下字段:
<field name="host" type="string" stored="false"indexed="true"/>
<field name="digest"type="string" stored="true" indexed="false"/>
<field name="segment"type="string" stored="true" indexed="false"/>
<field name="boost"type="float" stored="true" indexed="false"/>
<field name="tstamp"type="date" stored="true" indexed="false"/>
<field name="anchor"type="string" stored="true" indexed="true" multiValued="true"/>
<fieldname="cache" type="string" stored="true"indexed="false"/>
重啓Solr,從新爬取
1
[root@localhost apache-nutch-1.7]# bin/nutch crawl urls -dir crawl -depth 2 -topN 5 -
solr http://10.192.86.156:8983/solr
4.7 查看結果
在瀏覽器輸入http://10.192.86.156:8983/solr#/collection1/query,進行查詢
將solr放到tomcat容器裏面,,在此很少作介紹
ant編譯nutch:切換到nutch目錄:
http://news.163.com/
http://www.gov.cn/
http://www.sbsm.gov.cn/
http://news.stnn.cc/china/
http://www.zaobao.com/wencui/social
http://www.xinhuanet.com/politics/1.htm
http://news.china.com.cn/shehui/node_7185045.htm
bin/nutch inject xinwen -crawlId ee
bin/nutch generate -topN 3 -crawlId ee
bin/nutch fetch -all -crawlId ee
bin/nutch parse -all -crawlId ee
bin/nutch updatedb -all -crawlId ee
./crawl seed.txt fanti http://localhost:8983/solr 2
http://172.16.2.29:8983/solr#/collection2/query
bin/nutch readdb -dump /home/hadoop/get/20151210 -crawlId qq
+^http://([a-z0-9]*\.)*sohu.com
+^http://([a-z0-9]*\.)*.*
################################抓取更新###############################
修改nutch-site.xml 添加以下配置
<!-- 多長時間再抓取抓取以前抓過的頁面,單位秒。默認30天 -->
<property>
<name>db.fetch.interval.default</name>
<value>420480000</value>
<description>The default number of seconds between re-fetches of a page (30 days).
</description>
</property>
<!-- 多少時間後強制更新整個CrawlDB庫 -->
<property>
<name>db.fetch.interval.max</name>
<value>630720000</value>
<description>The maximum number of seconds between re-fetches of a page
(90 days). After this period every page in the db will be re-tried, no
matter what is its status.
</description>
</property>
##################### 模擬瀏覽器#######################################
在 nutch-site.xml 中只須要增長以下幾種配置之一即可以 模擬 某一瀏覽器:
一、模擬 Firefox 瀏覽器:
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:27.0) Gecko</value>
</property>
<property>
<name>http.agent.version</name>
<value>20100101 Firefox/27.0</value>
</property>
二、模擬 IE 瀏覽器:
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (compatible; MSIE 10.0; Windows NT 6.1; WOW64; Trident</value>
</property>
<property>
<name>http.agent.version</name>
<value>6.0)</value>
</property>
三、模擬 Chrome 瀏覽器:
<property>
<name>http.agent.name</name>
<value>Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko)
Chrome/33.0.1750.117 Safari</value>
</property>
<property>
<name>http.agent.version</name>
<value>537.36</value>
</property>