一、my.ini配置 java
找到 [mysqld]在下面添加 skip-grant-tables和character-set-server=utf8找到[mysql]、[client]在下面添加default-character-set=utf8 mysql
重啓mysql服務 web
(注:)若是已有的話就不須要添加 sql
二、建立數據庫與表 shell
CREATE TABLE `webpage` ( `id` varchar(767) CHARACTER SET latin1 NOT NULL, `headers` blob, `text` mediumtext, `status` int(11) DEFAULT NULL, `markers` blob, `parseStatus` blob, `modifiedTime` bigint(20) DEFAULT NULL, `score` float DEFAULT NULL, `typ` varchar(32) CHARACTER SET latin1 DEFAULT NULL, `baseUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `content` mediumblob, `title` varchar(2048) DEFAULT NULL, `reprUrl` varchar(512) CHARACTER SET latin1 DEFAULT NULL, `fetchInterval` int(11) DEFAULT NULL, `prevFetchTime` bigint(20) DEFAULT NULL, `inlinks` mediumblob, `prevSignature` blob, `outlinks` mediumblob, `fetchTime` bigint(20) DEFAULT NULL, `retriesSinceFetch` int(11) DEFAULT NULL, `protocolStatus` blob, `signature` blob, `metadata` blob, `batchId` varchar(500) DEFAULT NULL, PRIMARY KEY (`id`) ) ENGINE=InnoDB DEFAULT CHARSET=utf8;
2、Nutch的安裝與配置以及使用 數據庫
一、Nutch-2.2.X下載:http://apache.fayea.com/apache-mirror/nutch/下載,而後解壓至本地安裝目錄,如本地根目錄爲${NUTCH_HOME}; apache
二、配置nutch對mysql的支持,修改${APACHE_NUTCH_HOME}/ivy/ivy.xml文件,分別: app
1)找到如下行取消註釋 ide
<dependency org="mysql" name="mysql-connector-java" rev="5.1.18" conf="*->default"/>
2)修改如下行 oop
默認爲
<dependency org="org.apache.gora" name="gora-core" rev="0.3" conf="*->default"/>
修改後爲
<dependency org="org.apache.gora" name="gora-core" rev="0.2.1" conf="*->default"/>
3)取消註釋如下行
<dependency org="org.apache.gora" name="gora-sql" rev="0.1.1-incubating" conf="*->default" />
註釋:上2)、3)若是不修改會有異常異常信息爲
Exception in thread 「main」 Java.lang.ClassNotFoundException:org.apache.gora.sql.store.SqlStore
三、數據庫鏈接配置
編輯${NUTCH_HOME}/conf/gora.properties文件,註釋掉默認的數據庫鏈接配置,同時添加如下配置內容:
############################### # MySQL properties ################################ gora.sqlstore.jdbc.driver=com.mysql.jdbc.Driver gora.sqlstore.jdbc.url=jdbc:mysql://192.168.58.1:3306/nutch?createDatabaseIfNotExist=true gora.sqlstore.jdbc.user=root gora.sqlstore.jdbc.password=liuxun123
寫上你須要鏈接的數據庫地址以及用戶名密碼
四、修改nutch-site配置文件
將如下內容添加至${NUTCH_HOME}/conf/nutch-site.xml中的configuration節點中
<property> <name>http.agent.name</name> <value>LiuXun Nutch Spider</value> </property> <property> <name>http.accept.language</name> <value>ja-jp, en-us,en-gb,en;q=0.7,*;q=0.3</value> <description>Value of the 「Accept-Language」 request header field. This allows selecting non-English language as default one to retrieve. It is a useful setting for search engines build for certain national group. </description> </property> <property> <name>parser.character.encoding.default</name> <value>utf-8</value> <description>The character encoding to fall back to when no other information is available</description> </property> <property> <name>storage.data.store.class</name> <value>org.apache.gora.sql.store.SqlStore</value> <description>The Gora DataStore class for storing and retrieving data. Currently the following stores are available: …. </description> </property> <property> <name>generate.batch.id</name> <value>*</value> </property>
五、編譯Nutch-2.2.*
1)首先安裝Ant
2)進入${NUTCH_HOME}目錄下執行ant命令既可
3)編譯成功後${NUTCH_HOME}目錄下會有runtime這個目錄
六、網頁抓取以及配置
1)進入${NUTCH_HOME}/runtime/local目錄下
2)設置抓取的網站
執行命令
mkdir -p urls echo 'http://www.oschina.net/' > urls/seed.txt
3)爬取操做
bin/nutch crawl urls -depth 3 -topN 5
nutch命令前面章節介紹到了
執行完在mysql中即查看到爬蟲抓取的內容,以下圖: