Nutch學習筆記一 ---環境搭建

時間 2019-12-05

標籤 nutch 學習筆記環境搭建简体版

原文原文鏈接

學習環境： ubuntujava

概要：apache

Nutch 是一個開源Java 實現的搜索引擎。它提供了咱們運行本身的搜索引擎所需的所有工具。包括全文搜索和Web爬蟲。ubuntu

經過nutch，誕生了hadoop、tika、gora。app

先安裝SVN和Ant環境。(經過編譯源碼方式來使用nutch)svn

apt-get install ant
apt-get install subversion工具

hu@hu-VirtualBox:~/data/nutch$ svn co https://svn.apache.org/repos/asf/nutch/tags/release-1.6/
hu@hu-VirtualBox:~/data/nutch$ cd release-1.6/
hu@hu-VirtualBox:~/data/nutch/release-1.6$ ant
hu@hu-VirtualBox:~/data/nutch/release-1.6$ cd runtime/ oop

備註runtime目錄下有兩個目錄，分別表明了nutch兩種不一樣運行方式。deploy依賴hadoop。
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ ls
deploy local學習

那nutch和hadoop是經過什麼鏈接起來的？
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ ls deploy/
apache-nutch-1.6.job binfetch

是經過nutch腳本。經過hadoop命令吧apache-nutch-1.6.job提交給hadoop的JobTracker。this

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime$ cd local/
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ mkdir urls
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ touch urls/url.txt
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ vi urls/url.txt
備註：urls/url.txt中輸入爬取地址 http://blog.tianya.cn

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ./bin/nutch crawl
Usage: Crawl <urlDir> -solr <solrURL> [-dir d] [-threads n] [-depth i] [-topN N]
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ nohup ./bin/nutch crawl urls -dir data -threads 100 -depth 3 &

備註：查看運行概要 hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat nohup.out
查看運行詳情經過logs/hadoop.log文件

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls logs/
hadoop.log

經過查看nohup.out發現出現異常
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ cat nohup.out
solrUrl is not set, indexing will be skipped...
crawl started in: data
rootUrlDir = urls
threads = 100
depth = 3
solrUrl=null
Injector: starting at 2013-12-08 21:10:30
Injector: crawlDb: data/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
solrUrl is not set, indexing will be skipped...
crawl started in: data
rootUrlDir = urls
threads = 100
depth = 3
solrUrl=null
Injector: starting at 2013-12-08 21:10:38
Injector: crawlDb: data/crawldb
Injector: urlDir: urls
Injector: Converting injected urls to crawl db entries.
Injector: total number of urls rejected by filters: 0
Injector: total number of urls injected after normalization and filtering: 1
Injector: Merging injected urls into crawl db.
Injector: finished at 2013-12-08 21:10:53, elapsed: 00:00:14
Generator: starting at 2013-12-08 21:10:53
Generator: Selecting best-scoring urls due for fetch.
Generator: filtering: true
Generator: normalizing: true
Generator: jobtracker is 'local', generating exactly one partition.
Generator: Partitioning selected urls for politeness.
Generator: segment: data/segments/20131208211101
Generator: finished at 2013-12-08 21:11:08, elapsed: 00:00:15
Fetcher: No agents listed in 'http.agent.name' property.
Exception in thread "main" java.lang.IllegalArgumentException: Fetcher: No agents listed in 'http.agent.name' property.
    at org.apache.nutch.fetcher.Fetcher.checkConfiguration(Fetcher.java:1389)
    at org.apache.nutch.fetcher.Fetcher.fetch(Fetcher.java:1274)
    at org.apache.nutch.crawl.Crawl.run(Crawl.java:136)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:65)
    at org.apache.nutch.crawl.Crawl.main(Crawl.java:55)

【解決方案】
hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ vi conf/nutch-site.xml
打開 conf/nutch-site.xml. 在nutch-site.xml中添加"http.agent.name"信息。（conf/nutch-default.xml有默認配置信息）
<configuration>
    <property>
      <name>http.agent.name</name>
      <value>Mozilla/5.0 (Windows NT 6.1; WOW64; rv:20.0; WUID=11ec69f3ac129124d5a2480d127648e0; WTB=2938) Gecko/20100101 Firefox/20.0</value>
      <description>HTTP 'User-Agent' request header. MUST NOT be empty -
      please set this to a single word uniquely related to your organization.

NOTE: You should also check other related properties:

        http.robots.agents
        http.agent.description
        http.agent.url
        http.agent.email
        http.agent.version

and set their values appropriately.

</description>
</property>
</configuration>

(若是修改源文件中配置文件，即/release-1.6/conf/nutch-site.xml，在更改nutch配置文件以後，須要從新進行ant編譯)

hu@hu-VirtualBox:~/data/nutch/release-1.6/runtime/local$ ls data/
crawldb linkdb segments

下回再學關於查看抓取數據詳細信息。

總結：nutch的入門重點在於分析nutch腳本文件

參考：

http://yangshangchuan.iteye.com/category/275433

http://www.oschina.net/translate/nutch-tutorial Nutch 教程