上一篇文章介紹了nutch的安裝java
該文會簡單的抓取網站 http://www.6vhao.com web
1,打開目錄nutch-2.3/runtime/localshell
2,mkdir urlsapp
nano urls/url:添加連接webapp
http://www.6vhao.com保存退出elasticsearch
3,在local目錄下使用命令fetch
./bin/nutch 會出現全部可使用的命令網站
inject inject new urls into the database hostinject creates or updates an existing host table from a text file generate generate new batches to fetch from crawl db fetch fetch URLs marked during generate parse parse URLs marked during fetch updatedb update web table after parsing updatehostdb update host table after parsing readdb read/dump records from page database readhostdb display entries from the hostDB index run the plugin-based indexer on parsed batches elasticindex run the elasticsearch indexer - DEPRECATED use the index command instead solrindex run the solr indexer on parsed batches - DEPRECATED use the index command instead solrdedup remove duplicates from solr solrclean remove HTTP 301 and 404 documents from solr - DEPRECATED use the clean command instead clean remove HTTP 301 and 404 documents and duplicates from indexing backends configured via plugins parsechecker check the parser for a given url indexchecker check the indexing filters for a given url plugin load a plugin and run one of its classes main() nutchserver run a (local) Nutch server on a user defined port webapp run a local Nutch web application junit runs the given JUnit test or CLASSNAME run the class named CLASSNAME
3,咱們首先使用./bin/crawl 命令一站式抓取網頁url
4,爬取完成後進入hbase目錄下
spa
./bin/hbase shell 進入hbase shell,使用list能夠看到當前表:data_webpage,nutch爲其添加了後綴
5,hbase shell 中scan 'data_webpage'查看其內容,copy下樣例數據
tv.66ys.www:http/zy/ column=f:ts, timestamp=1446050113914, value=\x00\x00\x01P\xAFM\xA9s tv.66ys.www:http/zy/ column=il:http://www.66ys.tv/, timestamp=1446050113914, value=\xE7\xBB\xBC\xE8\x89\xBA tv.66ys.www:http/zy/ column=mk:dist, timestamp=1446050113914, value=2 tv.66ys.www:http/zy/ column=mtdt:_csh_, timestamp=1446050113914, value=\x00\x00\x00\x00 tv.66ys.www:http/zy/ column=s:s, timestamp=1446050113914, value=\x00\x00\x00\x00
更多內容下次再講吧~~~~~~~~~~~~~~~