在nutch與起點R3集成之筆記(一)中介紹了在起點R3中添加nutch要用到的索引字段,上述字段建好後,就能夠經過nutch抓取一個或多個網站內容,並經過 bin/nutch solrindex 送到起點R3索引庫中。css
3、nutch安裝與配置html
1.安裝nutchjava
先從http://www.apache.org/dist//nutch/apache-nutch-1.3-bin.zip下載nutch1.3,展開。nutch能夠在linux環境下運行,也能夠在windows環境下運行,也能夠導入到eclipse中運行。linux
在linux環境下安裝最簡單,將展開後runtime/local目錄下的內容上傳到linux的一個目錄下,如/opt/nutch1.3,同時將 /opt/nutch1.3/lib下的nutch-1.3.jar copy到 /opt/nutch1.3目錄,並更名爲 nutch-1.3.job,並chmod +x /opt/nutch1.3/bin。同時要有JDK環境,並在profile中設置JAVA_HOME,PATH中有JDK的bin路徑。在 /opt/nutch1.3目錄鍵入 bin/nutch ,出現以下提示:shell
[root@test nutch-1.3]# bin/nutch Usage: nutch [-core] COMMAND where COMMAND is one of: crawl one-step crawler for intranets readdb read / dump crawl db convdb convert crawl db from pre-0.9 format mergedb merge crawldb-s, with optional filtering readlinkdb read / dump link db inject inject new urls into the database generate generate new segments to fetch from crawl db freegen generate new segments to fetch from text files fetch fetch a segment's pages parse parse a segment's pages readseg read / dump segment data mergesegs merge several segments, with optional filtering and slicing updatedb update crawl db from segments after fetching invertlinks create a linkdb from parsed segments mergelinkdb merge linkdb-s, with optional filtering index run the indexer on parsed segments and linkdb solrindex run the solr indexer on parsed segments and linkdb merge merge several segment indexes dedup remove duplicates from a set of segment indexes solrdedup remove duplicates from solr plugin load a plugin and run one of its classes main() server run a search server or CLASSNAME run the class named CLASSNAME Most commands print help when invoked w/o parameters.
表示安裝成功。若是要安裝成hadoop模式,還須要從網上將hadoop一些運行腳本拷貝到bin目錄下。express
在windows環境下,必須安裝linux運行模擬環境軟件cygwin,從http://www.cygwin.org/cygwin/下載安裝cygwin。在cygwin下運行nutch跟linux須要的配置時同樣的,須要設置 java_home,path等等。apache
在enlipse環境下,如何導入nutch1.3,網上有不少介紹,但不少是錯的。其中一個重要的步驟是在構建路徑時要將conf放在路徑順序中最前面,以下圖:windows
並創建好主類爲org.apache.nutch.crawl.Crawl的java運行應用程序,以下圖:app
對應的自變量設置爲:less
2.配置nutch-site.xml
不管是在linux下,在cygwin下,仍是在eclipse環境裏,首先須要修改conf中nutch-site.xml文件,在nutch-site.xml中加入:
<property> <name>http.agent.name</name> <value>nutch-1.3</value> </property> <property> <name>http.robots.agents</name> <value>nutch-1.3,*</value> </property> <property> <name>plugin.includes</name> <value>protocol-http|urlfilter-regex|parse-(html|tika|js|zip|swf|rss)|index-(basic|anchor|more)|scoring-opic|urlnormalizer-(pass|regex|basic)</value> </property>
同時在在eclipse環境下,還須要在nutch-site.conf文件里加入:
<property> <name>plugin.folders</name> <value>./src/plugin</value> <description>Directories where nutch plugins are located. Each element may be a relative or absolute path. If absolute, it is used as is. If relative, it is searched for on the classpath.</description> </property>
3.配置solrindex-mapping.xml
同時,修改nutch1.3的conf中solrindex-mapping.xml文件,把nutch的索引字段與起點R3的定義的索引字段進行映射。內容以下:
<?xml version="1.0" encoding="UTF-8"?> <!-- Licensed to the Apache Software Foundation (ASF) under one or more contributor license agreements. See the NOTICE file distributed with this work for additional information regarding copyright ownership. The ASF licenses this file to You under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at http://www.apache.org/licenses/LICENSE-2.0 Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License. --> <mapping> <!-- Simple mapping of fields created by Nutch IndexingFilters to fields defined (and expected) in Solr schema.xml. Any fields in NutchDocument that match a name defined in field/@source will be renamed to the corresponding field/@dest. Additionally, if a field name (before mapping) matches a copyField/@source then its values will be copied to the corresponding copyField/@dest. uniqueKey has the same meaning as in Solr schema.xml and defaults to "id" if not defined. --> <fields> <field dest="title" source="title"/> <field dest="text" source="content"/> <field dest="lastModified" source="lastModified"/> <field dest="type" source="type"/> <field dest="site" source="site"/> <field dest="anchor" source="anchor"/> <field dest="host" source="host"/> <field dest="segment" source="segment"/> <field dest="boost" source="boost"/> <field dest="tstamp" source="tstamp"/> <field dest="url" source="url"/> <field dest="id" source="digest"/> <copyField source="digest" dest="digest"/> </fields> <uniqueKey>id</uniqueKey> </mapping>
4.配置 regex-urlfilter.xml
修改url過濾器,保證你要採集的網站,能不會被url過濾器給過濾掉,如要抓取新浪網站內容 ,因此在nutch的conf的regex-urlfilter.xml里加入一條:
+^http://www.sina
regex-urlfilter.xml內容以下:
# Licensed to the Apache Software Foundation (ASF) under one or more # contributor license agreements. See the NOTICE file distributed with # this work for additional information regarding copyright ownership. # The ASF licenses this file to You under the Apache License, Version 2.0 # (the "License"); you may not use this file except in compliance with # the License. You may obtain a copy of the License at # # http://www.apache.org/licenses/LICENSE-2.0 # # Unless required by applicable law or agreed to in writing, software # distributed under the License is distributed on an "AS IS" BASIS, # WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. # See the License for the specific language governing permissions and # limitations under the License. # The default url filter. # Better for whole-internet crawling. # Each non-comment, non-blank line contains a regular expression # prefixed by '+' or '-'. The first matching pattern in the file # determines whether a URL is included or ignored. If no pattern # matches, the URL is ignored. # skip file: ftp: and mailto: urls -^(file|ftp|mailto): # skip image and other suffixes we can't yet parse #-\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|ppt|mpg|xls|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ -\.(gif|GIF|jpg|JPG|png|PNG|ico|ICO|css|sit|eps|wmf|zip|mpg|gz|rpm|tgz|mov|MOV|exe|jpeg|JPEG|bmp|BMP)$ # skip URLs containing certain characters as probable queries, etc. #-[?*!@=] # skip URLs with slash-delimited segment that repeats 3+ times, to break loops -.*(/[^/]+)/[^/]+\1/[^/]+\1/ # accept anything else +^http://www.sina.
-.
5.在nutch1.3目錄下建一個 url目錄(url目錄與conf是統計目錄),而後在url目錄裏建一個url.txt文件,url.txt文件內容爲http://www.sina.com.cn 。