最近公司須要搭建一個搜索引擎,因而就發現了apache旗下的這個nutch,也看了很多的文章,就在本地搭建了一個進行測試,發現局域網抓取仍是比較好的,可是在互聯網抓取仍是有點問題,像百度、谷歌這些站點的頁面基本就抓不到上信息,不知道是配置問題仍是其餘問題,但願有知道的朋友聯繫我,謝謝.html
vmware 6.0
redhat 5.1
apache-tomcat-6.0.29.tar.gz
nutch-1.0.tar.gz
jdk-6u21-linux-i586.bin
nutchg簡介
Nutch的爬蟲抓取網頁有兩種方式,一種方式是Intranet Crawling,針對的是企業內部網或少許網站,使用的是crawl命令;另外一種方式是Whole-web crawling,針對的是整個互聯網,使用inject、generate、fetch和updatedb等更底層的命令.本文檔介紹Intranet Crawling的基本使用方法.
# cp jdk-6u21-linux-i586.bin /usr/java
# cd /usr/java
# chmod +x jdk-6u21-linux-i586.bin
# ./ jdk-6u21-linux-i586
# vi /etc/profile //添加以下的java環境變量
JAVA_HOME=/usr/java/jdk1.6.0_21
export JAVA_HOME
PATH=$JAVA_HOME/bin:$PATH
export PATH
CLASSPATH=$JAVA_HOME/lib/tools.jar:$JAVA_HOME/lib/dt.jar:$CLASSPATH
export CLASSPATH
# source /etc/profile //讓java環境變量當即生效
# java -version //測試java環境是否正常,返回版本信息,就表示jdk安裝沒有問題
# tar zxvf apache-tomcat-6.0.29.tar.gz -C /usr/local
# cd /usr/local/
# mv apache-tomcat-6.0.29 tomcat
# tar zxvf nutch-1.0.tar.gz -C /usr/local
# cd /usr/local
# mv nutch-1.0 nutch
# cd nutch
增長NUTCH_JAVA_HOME變量,並將其值設爲JDK的安裝目錄
NUTCH_JAVA_HOME=/usr/java/jdk1.6.0_21
export NUTCH_JAVA_HOME
Nutch抓取網站頁面前的準備工做
在Nutch的安裝目錄中創建一個名爲url.txt的文本文件,文件中寫入要抓取網站的頂級網址,即要抓取的起始頁.
這裏寫入國內比較有名的站點
編輯conf/crawl-urlfilter.txt文件,修改MY.DOMAIN.NAME部分:
# accept hosts in MY.DOMAIN.NAME
+^http://([a-z0-9]*\.)*com/
+^http://([a-z0-9]*\.)*cn/
+^http://([a-z0-9]*\.)*net/
解決搜索動態內容的問題
須要注意在conf下面的2個文件:regex-urlfilter.txt,crawl-urlfilter.txt
# skip URLs containing certain characters as probable queries, etc.
-[?*!@=]
這段意思是跳過在鏈接中存在? * ! @ = 的頁面,由於默認是跳過因此,在動態頁中存在?通常
按照默認的是不能抓取到的.能夠在上面2個文件中都修改爲:
# skip URLs containing certain characters as probable queries, etc.
# -[?*!@=] //前面加上註釋.
另外增長容許的一行
# accept URLs containing certain characters as probable queries, etc.
+[?=&]
意思是抓取時候容許抓取鏈接中帶 ? = & 這三個符號的鏈接
注意:兩個文件都須要修改,由於NUTCH加載規則的順序是crawl-urlfilter.txt->
regex-urlfilter.txt
編輯conf/nutch-site.xml文件,在configuration中間加入一下內容
<property>
<name>http.agent.name</name>
<value>sxit nutch agent</value>
</property>
<property>
<name>http.agent.version</name>
<value>1.0</value>
</property>
/usr/local/nutch/bin/nutch crawl /usr/local/nutch/url.txt -dir /usr/local/nutch/sxit -depth 3 -threads 4 >& /usr/loca/nutch/crawl.log
等待大約一段時間後,程序運行結束.會發如今nutch目錄下被建立了一個名爲sxit的文件夾,同時還生成一個名爲crawl.log的日誌文件.利用這一日誌文件,咱們能夠分析可能遇到的任何錯誤.另外,在上述命令的參數中,dir指定抓取內容所存放的目錄,depth表示以要抓取網站頂級網址爲起點的爬行深度,threads指定併發的線程數.
使用Tomcat進行搜索測試
將nutch目錄的nutch-1.0.war複製到tomcat\webapps下,這裏須要啓動下tomcat,而後就在webapps下面生成一個nutch-1.0的文件夾,打開 nutch-1.0\WEB-INF\classes下的nutch-site.xml文件,
//因爲這裏是最新的版本,原來這個配置文件的內容都刪掉,添加以下的內容
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="nutch-conf.xsl"?>
<!-- Put site-specific property overrides in this file. -->
<nutch-conf>
<property>
<name>searcher.dir</name>
<value>/usr/local/nutch/sxit</value> //這裏爲剛纔抓取內容所存放的目錄
</property>
</nutch-conf>
在文本框中輸入關鍵字,就能夠進行搜索了.不過用戶在使用時會發現,對於英文單詞的搜索一切正常,而當要搜索中文詞語時會出現亂碼.其實這個問題是Tomcat設置的問題,解決辦法是修改tomcat\conf下的server.xml文件,將其中的Connector部分改爲以下形式便可:
<Connector port="8080" maxThreads="150" minSpareThreads="25" maxSpareThreads="75"
enableLookups="false" redirectPort="8443" acceptCount="100"
connectionTimeout="20000" disableUploadTimeout="true"
URIEncoding="UTF-8" useBodyEncodingForURI="true" />
# cd /usr/local/apache-tomcat-6.0.29/ webapps/nutch-1.0
# vi search.jsp
查找int hitsPerSite 把=後面的值改爲0,
而後在這個jsp文件的末尾增長以下的代碼:
<table align="center">
<tr>
<td>
<%
if (start >= hitsPerPage) // more hits to show
{
%>
<form name="pre" action="../search.jsp" method="get">
<input type="hidden" name="query" value="<%=htmlQueryString%>">
<input type="hidden" name="lang" value="<%=queryLang%>">
<input type="hidden" name="start" value="<%=start - hitsPerPage%>">
<input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
<input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
<input type="hidden" name="clustering" value="<%=clustering%>">
<input type="submit" value="上一頁">
<%} %>
</form>
<%
int startnum=1;//頁面中最前面的頁碼編號,我設定(知足)共10頁,當頁爲第6頁
if((int)(start/hitsPerPage)>=5)
startnum=(int)(start/hitsPerPage)-4;
for(int i=hitsPerPage*(startnum-1),j=0;i<=hits.getTotal()&&j<=10;)
{
%>
<td>
<form name="next" action="../search.jsp" method="get">
<input type="hidden" name="query" value="<%=htmlQueryString%>">
<input type="hidden" name="lang" value="<%=queryLang%>">
<input type="hidden" name="start" value="<%=i%>">
<input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
<input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
<input type="hidden" name="clustering" value="<%=clustering%>">
<input type="submit" value="<%=i/hitsPerPage+1 %>">
</form>
</td>
<%
i=i+10; //這裏的10是分頁顯示頁面數
j++;
}
%>
<td>
<%
if ((hits.totalIsExact() && end < hits.getTotal()) // more hits to show
|| (!hits.totalIsExact() && (hits.getLength() > start
+ hitsPerPage))) {
%>
<form name="next" action="../search.jsp" method="get">
<input type="hidden" name="query" value="<%=htmlQueryString%>">
<input type="hidden" name="lang" value="<%=queryLang%>">
<input type="hidden" name="start" value="<%=end%>">
<input type="hidden" name="hitsPerPage" value="<%=hitsPerPage%>">
<input type="hidden" name="hitsPerSite" value="<%=hitsPerSite%>">
<input type="hidden" name="clustering" value="<%=clustering%>">
<input type="submit" value="<i18n:message key="next"/>">//下一頁
</form>
<%} %>
</td>
</tr>
</table>
- #!/bin/sh
- depth=5
- threads=5
- RMARGS="-rf"
- MVARGS="--verbose"
- safe=yes
- NUTCH_HOME=/usr/local/nutch
- CATALINA_HOME=/usr/local/apache-tomcat-6.0.29
- if [ -z "$NUTCH_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: NUTCH_HOME=$NUTCH_HOME has been set by the script
- else
- echo runbot: $0 found environment variable NUTCH_HOME=$NUTCH_HOME
- fi
- if [ -z "$CATALINA_HOME" ]
- then
- echo runbot: $0 could not find environment variable NUTCH_HOME
- echo runbot: CATALINA_HOME=$CATALINA_HOME has been set by the script
- else
- echo runbot: $0 found environment variable CATALINA_HOME=$CATALINA_HOME
- fi
- if [ -n "$topN" ]
- then
- topN="-topN $topN"
- else
- topN=""
- fi
- steps=8
- echo "----- Inject (Step 1 of $steps) -----"
- $NUTCH_HOME/bin/nutch inject $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/url.txt
- echo "----- Generate, Fetch, Parse, Update (Step 2 o $steps) -----"
- for((i=0; i <= $depth; i++))
- do
- echo "--- Beginning crawl at depth `expr $i + 1` of $depth ---"
- $NUTCH_HOME/bin/nutch generate $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/sxit/segments
- if [ $? -ne 0 ]
- then
- echo "runbot: Stopping at depth $depth. No more URLs to fetcfh."
- break
- fi
- segment=`ls -d $NUTCH_HOME/sxit/segments/* | tail -1`
- $NUTCH_HOME/bin/nutch fetch $segment -threads $threads
- if [ $? -ne 0 ]
- then
- echo "runbot: fetch $segment at depth `expr $i + 1` failed."
- echo "runbot: Deleting segment $segment."
- rm $RMARGS $segment
- continue
- fi
- $NUTCH_HOME/bin/nutch updatedb $NUTCH_HOME/sxit/crawldb $segment
- done
- echo "----- Merge Segments (Step 3 of $steps) -----"
- $NUTCH_HOME/bin/nutch mergesegs $NUTCH_HOME/sxit/MERGEDsegments $NUTCH_HOME/sxit/segments/*
- mv $MVARGS $NUTCH_HOME/sxit/segments $NUTCH_HOME/sxit/BACKUPsegments
- mkdir $NUTCH_HOME/sxit/segments
- mv $MVARGS $NUTCH_HOME/sxit/MERGEDsegments/* $NUTCH_HOME/sxit/segments
- rm $RMARGS $NUTCH_HOME/sxit/MERGEDsegments
- echo "----- Invert Links (Step 4 of $steps) -----"
- $NUTCH_HOME/bin/nutch invertlinks $NUTCH_HOME/sxit/linkdb $NUTCH_HOME/sxit/segments/*
- echo "----- Index (Step 5 of $steps) -----"
- $NUTCH_HOME/bin/nutch index $NUTCH_HOME/sxit/NEWindexes $NUTCH_HOME/sxit/crawldb $NUTCH_HOME/sxit/linkdb $NUTCH_HOME/sxit/segments/*
- echo "----- Dedup (Step 6 of $steps) -----"
- $NUTCH_HOME/bin/nutch dedup $NUTCH_HOME/sxit/NEWindexes
- echo "----- Merge Indexes (Step 7 of $steps) -----"
- $NUTCH_HOME/bin/nutch merge $NUTCH_HOME/sxit/NEWindex $NUTCH_HOME/sxit/NEWindexes
- echo "----- Loading New Index (Step 8 of $steps) -----"
- tom_pid=`ps aux |awk '/usr\/local\/apache-tomcat-6.0.29/ {print $2}'`
- `kill -9 $tom_pid`
- if [ "$safe" != "yes" ]
- then
- rm $RMARGS $NUTCH_HOME/sxit/NEWindexes
- rm $RMARGS $NUTCH_HOME/sxit/index
- else
- mv $MVARGS $NUTCH_HOME/sxit/NEWindexes $NUTCH_HOME/sxit/indexes
- mv $MVARGS $NUTCH_HOME/sxit/NEWindex $NUTCH_HOME/sxit/index
- fi
- ${CATALINA_HOME}/bin/startup.sh
- echo "runbot: FINISHED: Crawl completed!"
- echo ""