首先,何以見得crawl是inject,generate,fetch,parse,update的集成呢(命令的具體含義及功能會在後續文章中說明),咱們打開NUTCH_HOME/runtime/local/bin/crawljava
我將主要代碼黏貼下來web
# initial injection echo "Injecting seed URLs" __bin_nutch inject "$SEEDDIR" -crawlId "$CRAWL_ID" # main loop : rounds of generate - fetch - parse - update for ((a=1; a <= LIMIT ; a++)) do ... echo "Generating a new fetchlist" generate_args=($commonOptions -topN $sizeFetchlist -noNorm -noFilter -adddays $addDays -crawlId "$CRAWL_ID" -batchId $batchId) $bin/nutch generate "${generate_args[@]}" ... echo "Fetching : " __bin_nutch fetch $commonOptions -D fetcher.timelimit.mins=$timeLimitFetch $batchId -crawlId "$CRAWL_ID" -threads 50 ... __bin_nutch parse $commonOptions $skipRecordsOptions $batchId -crawlId "$CRAWL_ID" ... __bin_nutch updatedb $commonOptions $batchId -crawlId "$CRAWL_ID" ... echo "Indexing $CRAWL_ID on SOLR index -> $SOLRURL" __bin_nutch index $commonOptions -D solr.server.url=$SOLRURL -all -crawlId "$CRAWL_ID" ... echo "SOLR dedup -> $SOLRURL" __bin_nutch solrdedup $commonOptions $SOLRURL
接下來手動執行上述步驟shell
咱們一直會處在runtime/local/ 目錄下ide
1,inject:oop
固然,種子文件先要寫好,urls/url 文件中寫入想要抓取的網站,我以http://www.6vhao.com爲例 fetch
在抓取期間,我不想讓他抓取除了6vhao.com之外的其餘網站,這個能夠在conf/regex-urlfilter.txt文件中加入網站
# accept anything elsethis
+^http://www.6vhao.com/ url
使用下面的命令開始抓取:spa
./bin/nutch inject urls/url -crawlId 6vhao
在hbase shell中使用list命令中查看,生成了一張新表6vhao_webpage
scan '6vhao_webpage'查看其內容
ROW COLUMN+CELL com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135434505, value=y com.6vhao.www:http/ column=mk:dist, timestamp=1446135434505, value=0 com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
能夠看出生成了一行hbase 數據,4列族數據,具體含義之後再說
2,generator
使用命令./bin/nutch generate
-topN <N> - number of top URLs to be selected, default is Long.MAX_VALUE -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id)"); -noFilter - do not activate the filter plugin to filter the url, default is true -noNorm - do not activate the normalizer plugin to normalize the url, default is true -adddays - Adds numDays to the current time to facilitate crawling urls already fetched sooner then db.fetch.interval.default. Default value is 0. -batchId - the batch id
咱們指定-crawlId 爲 6vhao
./bin/nutch generate -crawlId 6vhao
com.6vhao.www:http/ column=f:bid, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=f:fi, timestamp=1446135434505, value=\x00'\x8D\x00 com.6vhao.www:http/ column=f:ts, timestamp=1446135434505, value=\x00\x00\x01P\xB4c\x86\xAA com.6vhao.www:http/ column=mk:_gnmrk_, timestamp=1446135900858, value=1446135898-215760616 com.6vhao.www:http/ column=mk:_injmrk_, timestamp=1446135900858, value=y com.6vhao.www:http/ column=mk:dist, timestamp=1446135900858, value=0 com.6vhao.www:http/ column=mtdt:_csh_, timestamp=1446135434505, value=?\x80\x00\x00 com.6vhao.www:http/ column=s:s, timestamp=1446135434505, value=?\x80\x00\x00
對比發現多了2列數據
3,開始抓取 fetch
Usage: FetcherJob (<batchId> | -all) [-crawlId <id>] [-threads N] [-resume] [-numTasks N] <batchId> - crawl identifier returned by Generator, or -all for all generated batchId-s -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) -threads N - number of fetching threads per task -resume - resume interrupted job -numTasks N - if N > 0 then use this many reduce tasks for fetching (default: mapred.map.tasks)
./bin/nutch fetch -all -crawlId 6vhao -threads 8
數據較多,基本網頁的內容全都在,自行到hbase中查看
4,parse
Usage: ParserJob (<batchId> | -all) [-crawlId <id>] [-resume] [-force] <batchId> - symbolic batch ID created by Generator -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id) -all - consider pages from all crawl jobs -resume - resume a previous incomplete job -force - force re-parsing even if a page is already parsed root@tong:/opt/nutch/nutch-2.3/runtime/local# ./bin/nutch parse -crawlId 6vhao -all
./bin/nutch parse -crawlId 6vhao -all
pase結果可在hbase中查看
5,update
Usage: DbUpdaterJob (<batchId> | -all) [-crawlId <id>] <batchId> - crawl identifier returned by Generator, or -all for all generated batchId-s -crawlId <id> - the id to prefix the schemas to operate on, (default: storage.crawl.id)
./bin/nutch updatedb -all -crawlId 6vhao
結果可在hbase中查看
6,重複2-5步驟,即抓該網站2層的深度
solrindex下節再講....