ubuntu部署nutch1.4

時間 2019-11-20

標籤 ubuntu 部署 nutch1.4 nutch 欄目 Ubuntu 简体版

原文原文鏈接

    以前一直在學習網絡爬蟲heritrix與lucene，並勵志用Heritrix+Lucene作畢業設計，自學挺累的，沒有一個明確的方向，一直想找個作搜索的公司實習一段時間，眼看就要畢業了，實習的願望也快泡湯了，如今只想着多接觸一些新的東西。

    如今開始學習nutch1.4，因爲網上的文章不多是關於1.4的，故寫了這篇文章但願對一些想學習網絡爬蟲的人有一些幫助，同時，也但願大家不要向我同樣走了不少彎路，廢話少說，直接進入正題吧！

nutch官網http://wiki.apache.org/nutch/NutchTutorial有專門的講解，我如今把它翻譯過來，但願對一些想學習的人有用，首先是安轉nutch，這個就不介紹了，你們可上官網直接下載就是了。

     關於怎麼安裝JDK以及怎麼配置環境變量，這裏也很少作介紹，網上有不少的例子。下載完nutch1.4後，好比加壓到/home/chenyanting/nutch目錄，可以使用解壓命令：tar zxvf apache-nutch-1.4-bin.tar.gz

解壓完之後直接進入/home/chenyanting/nutch/apache-nutch-1.4-bin/runtime/local

在此目錄下運行命令 ./bin/nutch 若沒有出現下面的內容:
Usage: nutch [-core] COMMAND
where COMMAND is one of:
crawl             one-step crawler for intranets
readdb            read / dump crawl db
mergedb           merge crawldb-s, with optional filtering
readlinkdb        read / dump link db
inject            inject new urls into the database
generate          generate new segments to fetch from crawl db
freegen           generate new segments to fetch from text files
fetch             fetch a segment's pages
parse             parse a segment's pages
readseg           read / dump segment data
mergesegs         merge several segments, with optional filtering and slicing
updatedb          update crawl db from segments after fetching
invertlinks       create a linkdb from parsed segments
mergelinkdb       merge linkdb-s, with optional filtering
solrindex         run the solr indexer on parsed segments and linkdb
solrdedup         remove duplicates from solr
solrclean         remove HTTP 301 and 404 documents from solr
parsechecker      check the parser for a given url
indexchecker      check the indexing filters for a given url
domainstats       calculate domain statistics from crawldb
webgraph          generate a web graph from existing segments
linkrank          run a link analysis program on the generated web graph
scoreupdater      updates the crawldb with linkrank scores
nodedumper        dumps the web graph's node scores
plugin            load a plugin and run one of its classes main()
junit             runs the given JUnit test
or
CLASSNAME         run the class named CLASSNAME
Most commands print help when invoked w/o parame

則要修改nutch解壓目錄中的runtime/local/bin/nutch腳本的執行權限   chmod 755 nutch

而後在設置JAVA_HOME

export JAVA_HOME='java路徑'

而後修改這個目錄下的conf/nutch-site.xml文件，加入以下屬性：
<property>
 <name>http.agent.name</name>
 <value>My Nutch Spider</value>
</property>

建立存放url的目錄

mkdir -p urls
cd urls
在裏面新建文件seeds.txt
往這個文件裏面加入你要爬取的地址好比：
```
http://nutch.apache.org/
```
修改文件conf/regex-urlfilter.txt，在最後加上

+^http://([a-z0-9]*\.)*nutch.apache.org/(把最後一行覆蓋掉)


   接着退回到local目錄，運行命令:
bin/nutch crawl urls -dir crawl -depth 3 -topN 5

本文出自「陳硯羲」博客，轉載請與做者聯繫！java

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。