nutch2 crawl 命令分解，抓取網頁的詳細過程

時間 2019-11-20

標籤 nutch2 nutch crawl 命令分解抓取網頁詳細過程欄目 HTML 简体版

原文原文鏈接

首先，何以見得crawl是inject,generate,fetch,parse,update的集成呢(命令的具體含義及功能會在後續文章中說明)，咱們打開NUTCH_HOME/runtime/local/bin/crawljava

我將主要代碼黏貼下來web

# initial injection
echo "Injecting seed URLs"
__bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID"

# main loop : rounds of generate - fetch - parse - update
for ((a=1; a <= LIMIT ; a++))
do
...
echo "Generating a new fetchlist"
  generate_args=($commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId)
$bin/nutch generate "${generate_args[@]}"
...
echo "Fetching : "
  __bin_nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50
...
__bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID"
...
 __bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID"
...
echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL"
__bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID"
...
echo "SOLR dedup -> $SOLRURL"
__bin_nutch solrdedup $commonOptions $SOLRURL

接下來手動執行上述步驟shell

咱們一直會處在runtime/local/ 目錄下ide

1，inject:oop

固然，種子文件先要寫好，urls/url 文件中寫入想要抓取的網站，我以http://www.6vhao.com爲例 fetch

在抓取期間，我不想讓他抓取除了6vhao.com之外的其餘網站，這個能夠在conf/regex-urlfilter.txt文件中加入網站

# accept anything elsethis

+^http://www.6vhao.com/ url

使用下面的命令開始抓取：spa

./bin/nutch inject urls/url -crawlId 6vhao

在hbase shell中使用list命令中查看，生成了一張新表6vhao_webpage

scan '6vhao_webpage'查看其內容

ROW                                 COLUMN+CELL                                                                                            
 com.6vhao.www:http/                column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00                                              
 com.6vhao.www:http/                column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                 
 com.6vhao.www:http/                column=mk:_injmrk_, timestamp=1446135434505, value=y                                                   
 com.6vhao.www:http/                column=mk:dist, timestamp=1446135434505, value=0                                                       
 com.6vhao.www:http/                column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00                                        
 com.6vhao.www:http/                column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

能夠看出生成了一行hbase 數據，4列族數據，具體含義之後再說

2，generator

使用命令./bin/nutch generate

    -topN <N>      - number of top URLs to be selected, default is Long.MAX_VALUE 
    -crawlId <id>  - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)");
    -noFilter      - do not activate the filter plugin to filter the url, default is true 
    -noNorm        - do not activate the normalizer plugin to normalize the url, default is true 
    -adddays       - Adds numDays to the current time to facilitate crawling urls already
                     fetched sooner then db.fetch.interval.default. Default value is 0.
    -batchId       - the batch id

咱們指定-crawlId 爲 6vhao

./bin/nutch generate -crawlId 6vhao

 com.6vhao.www:http/                column=f:bid, timestamp=1446135900858, value=1446135898-215760616                                      
 com.6vhao.www:http/                column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00                                              
 com.6vhao.www:http/                column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA                                 
 com.6vhao.www:http/                column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616                                 
 com.6vhao.www:http/                column=mk:_injmrk_, timestamp=1446135900858, value=y                                                   
 com.6vhao.www:http/                column=mk:dist, timestamp=1446135900858, value=0                                                       
 com.6vhao.www:http/                column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00                                        
 com.6vhao.www:http/                column=s:s, timestamp=1446135434505, value=?\x80\x00\x00

對比發現多了2列數據

3，開始抓取 fetch

Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] 
                  [-resume] [-numTasks N]
    <batchId>     - crawl identifier returned by Generator, or -all for all 
                    generated batchId-s
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)
    -threads N    - number of fetching threads per task
    -resume       - resume interrupted job
    -numTasks N   - if N > 0 then use this many reduce tasks for fetching 
                    (default: mapred.map.tasks)

./bin/nutch fetch -all -crawlId 6vhao -threads 8

數據較多，基本網頁的內容全都在，自行到hbase中查看

4，parse

Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force]
    <batchId>     - symbolic batch ID created by Generator
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)
    -all          - consider pages from all crawl jobs
    -resume       - resume a previous incomplete job
    -force        - force re-parsing even if a page is already parsed
root@tong:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlId 6vhao -all

./bin/nutch parse -crawlId 6vhao -all

pase結果可在hbase中查看

5，update

Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>]     <batchId>     - crawl identifier returned by Generator, or -all for all 
                    generated batchId-s
    -crawlId <id> - the id to prefix the schemas to operate on, 
                    (default: storage.crawl.id)

./bin/nutch updatedb -all -crawlId 6vhao

結果可在hbase中查看

6,重複2-5步驟，即抓該網站2層的深度

solrindex下節再講....