編寫一個最簡單的Nutch插件

nutch是高度可擴展的,他使用的插件系統是基於Eclipse2.x的插件系統。在這篇文章中我講解一下如何編寫一個nutch插件,以及在這個過程當中我遇到的坑。html

請先確保你在eclipse中成功運行了nutch,能夠參考在eclipse中運行nutchjava

咱們要實現的插件的功能是接管抓取過程,而後不管抓取什麼網址,咱們都返回hello world,夠簡單吧。。。apache

插件機制

nutch的插件機制大體是這樣;nutch自己暴露了幾個擴展點,每一個擴展點都是一個接口,咱們能夠經過實現接口來實現這個擴展點,這就是一個擴展。一個插件能夠保護多個擴展。eclipse

這是nutch官網列舉的nutch的幾個主要擴展點:ide

  • IndexWriter -- Writes crawled data to a specific indexing backends (Solr, ElasticSearch, a CVS file, etc.).oop

  • IndexingFilter -- Permits one to add metadata to the indexed fields. All plugins found which implement this extension point are run sequentially on the parse (from javadoc).學習

  • Parser -- Parser implementations read through fetched documents in order to extract data to be indexed. This is what you need to implement if you want Nutch to be able to parse a new type of content, or extract more data from currently parseable content.測試

  • HtmlParseFilter -- Permits one to add additional metadata to HTML parses (from javadoc).fetch

  • Protocol -- Protocol implementations allow Nutch to use different protocols (ftp, http, etc.) to fetch documents.網站

  • URLFilter -- URLFilter implementations limit the URLs that Nutch attempts to fetch. The RegexURLFilter distributed with Nutch provides a great deal of control over what URLs Nutch crawls, however if you have very complicated rules about what URLs you want to crawl, you can write your own implementation.

  • URLNormalizer -- Interface used to convert URLs to normal form and optionally perform substitutions.

  • ScoringFilter -- A contract defining behavior of scoring plugins. A scoring filter will manipulate scoring variables in CrawlDatum and in resulting search indexes. Filters can be chained in a specific order, to provide multi-stage scoring adjustments.

  • SegmentMergeFilter -- Interface used to filter segments during segment merge. It allows filtering on more sophisticated criteria than just URLs. In particular it allows filtering based on metadata collected while parsing page.

咱們要接管網頁抓取部分,因此Protocol擴展點是咱們的目標。

分析插件protocol-http

nutch包含許多默認插件。這些插件的源代碼在src/plugin中。若是咱們抓取的url是http協議的,nutch就會使用protocol-http插件。分析是最好的學習,咱們來看看protocol-http插件是如何實現的。

目錄結構

protocol-http源碼的目錄結構

protocol-http:                                            
│  build.xml    // 插件的ant build文件,描述如何build插件                                │  ivy.xml      // 定義插件因此來的第三方庫                               │  plugin.xml   // 插件描述文件,nutch經過其中的內容來得知該插件實現了哪一個擴展點,進而決定什麼時候調用插件│                                               
└─src           // 插件源碼目錄                                
    ├─java                                      
    │  └─org                                    
    │      └─apache                             
    │          └─nutch                          
    │              └─protocol                   
    │                  └─http                   
    │                          Http.java        
    │                          HttpResponse.java
    │                          package.html     
    │                                           
    └─test                                      
        └─org                                   
            └─apache                            
                └─nutch                         
                    └─protocol                  
                        └─http

分析plugin文件

分別看看各個文件中的內容

build.xml:

<?xml version="1.0"?>
<project name="protocol-http" default="jar-core">  // name屬性定義了插件的名字

  <import file="../build-plugin.xml"/>

  <!-- Build compilation dependencies -->
  <target name="deps-jar">
    <ant target="jar" inheritall="false" dir="../lib-http"/>  // protocol-http插件依賴了另一個插件lib-http
  </target>

  <!-- Add compilation dependencies to classpath -->
  <path id="plugin.deps">
    <fileset dir="${nutch.root}/build">
      <include name="**/lib-http/*.jar" />
    </fileset>
  </path>

  <!-- Deploy Unit test dependencies -->
  <target name="deps-test">
    <ant target="deploy" inheritall="false" dir="../lib-http"/>
    <ant target="deploy" inheritall="false" dir="../nutch-extensionpoints"/>
  </target>

</project>

ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>

</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="protocol-http"              // 插件id
   name="Http Protocol Plug-in"    // 插件名字
   version="1.0.0"                 // 插件版本
   provider-name="nutch.org">      // 插件提供者

   <runtime>
      <library name="protocol-http.jar">  // 插件最終生成的jar名
         <export name="*"/>
      </library>
   </runtime>

   <requires>                      // 插件須要的其餘插件
      <import plugin="nutch-extensionpoints"/>  
      <import plugin="lib-http"/>
   </requires>   // 插件包含的擴展
   <extension id="org.apache.nutch.protocol.http"           // 擴展id
              name="HttpProtocol"                           // 擴展名
              point="org.apache.nutch.protocol.Protocol">   // 擴展點

      // 擴展能夠包含多個實現
      <implementation id="org.apache.nutch.protocol.http.Http"      // 實現id
                      class="org.apache.nutch.protocol.http.Http">  // 實現類
        <parameter name="protocolName" value="http"/>               // 若是protocolName爲http則使用該實現(關於這一點,nutch文檔裏找不到相關定義)
      </implementation>

      <implementation id="org.apache.nutch.protocol.http.Http"
                       class="org.apache.nutch.protocol.http.Http">
           <parameter name="protocolName" value="https"/>
      </implementation>

   </extension>

</plugin>

最簡單配置文件

經過概括,能夠得出最簡單的配置文件格式(由於在nutch文檔中沒有找到詳細定義,因此只能推理概括了。。。)

build.xml:

<?xml version="1.0"?>
<project name="插件ID" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

若是沒有以來第三方庫,ivy.xml直接這樣寫就能夠;

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="插件ID"
   name="插件名稱"
   version="插件版本x.x.x"
   provider-name="插件做者">

   <runtime>
      <library name="插件ID.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="擴展ID(通常使用擴展的包名)"
              name="擴展名稱"
              point="擴展點(接口完整名稱)">

      <implementation id="實現ID"
                      class="實現類型(完整名)">
                      <parameter name="protocolName" value="http"/> // 參數,根據不一樣的擴展點不同
      </implementation>

   </extension>

</plugin>

插件開工

知道了一個插件的結構後,咱們就能夠依樣畫葫蘆了。咱們定義插件的名字爲protocol-test,插件實現的擴展點也是org.apache.nutch.protocol.Protocol

在src/plugin中新建目錄protocol-test。

編寫描述文件

新建build.xml:

<?xml version="1.0"?>
<project name="protocol-test" default="jar-core">

  <import file="../build-plugin.xml"/>

</project>

新建ivy.xml:

<?xml version="1.0" ?>
<ivy-module version="1.0">
  <info organisation="org.apache.nutch" module="${ant.project.name}">
    <license name="Apache 2.0"/>
    <ivyauthor name="Apache Nutch Team" url="http://nutch.apache.org"/>
    <description>
        Apache Nutch
    </description>
  </info>

  <configurations>
    <include file="../../..//ivy/ivy-configurations.xml"/>
  </configurations>

  <publications>
    <!--get the artifact from our module name-->
    <artifact conf="master"/>
  </publications>

  <dependencies>
  </dependencies>
</ivy-module>

新建plugin.xml:

<?xml version="1.0" encoding="UTF-8"?>
<plugin
   id="protocol-test"
   name="Protocol Plug-in Test"
   version="1.0.0"
   provider-name="mushan">

   <runtime>
      <library name="protocol-test.jar">
         <export name="*"/>
      </library>
   </runtime>

   <requires>
      <import plugin="nutch-extensionpoints"/>
   </requires>

   <extension id="com.mushan.protocol"
              name="ProtocolTest"
              point="org.apache.nutch.protocol.Protocol">

      <implementation id="com.mushan.protocol.Test"
                      class="com.mushan.protocol.Test">
                      <parameter name="protocolName" value="http"/>
      </implementation>

   </extension>

</plugin>

在eclipse中導入插件目錄

在protocol-test中新建目錄src/java/com/mushan/protocol,注意目錄結構是和implementation中的class名稱結構是同樣的。

先刷新工程,而後打開nutch工程的屬性,Java Build Path > Source > Add Folder...

sourcetab

在對話框中選擇插件的代碼目錄並添加:

addpluginfolder

編寫插件核心代碼

新建Test類,記得選擇接口爲org.apache.nutch.protocol.Protocol,也就是要實現的擴展點:

newtestclass

Test類代碼以下:

package com.mushan.protocol;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.io.Text;
import org.apache.nutch.crawl.CrawlDatum;
import org.apache.nutch.metadata.Metadata;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.protocol.Protocol;
import org.apache.nutch.protocol.ProtocolOutput;
import org.apache.nutch.protocol.RobotRulesParser;

import crawlercommons.robots.BaseRobotRules;
public class Test implements Protocol {
    private Configuration conf = null;
    @Override
    public Configuration getConf() {
        return this.conf;
    }
    @Override
    public void setConf(Configuration conf) {
        this.conf = conf;
    }
    @Override
    public ProtocolOutput getProtocolOutput(Text url, CrawlDatum datum) {
        Content c = new Content(url.toString(),url.toString(),"hello world".getBytes(),"text/html",new Metadata(),this.conf);            // 返回的網頁內容爲"hello world"
        return new ProtocolOutput(c);
    }
    @Override
    public BaseRobotRules getRobotRules(Text url, CrawlDatum datum) {
        return RobotRulesParser.EMPTY_RULES;    // 沒有robot規則
    }
}

以上,插件部分的代碼就寫完了。可是爲了讓nutch構建插件並加載插件,還得有些配置。

整合插件到nutch中

設置conf/nutch-site.xml文件,啓用咱們的插件:

<configuration>
  <property>
    <name>http.agent.name</name>
    <value>MySpider</value>
  </property>

  <property>
    <name>plugin.folders</name>
    <value>build/plugins</value>
  </property>

  <property>
    <name>plugin.includes</name>
    <value>protocol-test|urlfilter-regex|parse-(html|tika)|index-(basic|anchor)|indexer-solr|scoring-opic|urlnormalizer-(pass|regex|basic)</value>  // 把原來的protocol-html換成protocol-test
  </property>
</configuration>

在src/plugin/build.xml的<target name="deploy">中添加

<ant dir="protocol-test" target="deploy"/>

這樣在ant構建的時候,就會生成插件的jar包了。

不過所有生成比較耗時,因此咱們定義一個臨時ant task,只生成咱們的插件。在src/plugin/build.xml中添加:

<target name="my-plugin">
    <ant dir="protocol-test" target="deploy"/>
</target>

eclipse中執行ant build來構建插件:

antbuildsetting

anttaskselect

點擊run,就會在plugins\protocol-test目錄中生成plugin.xml和protocol-test.jar。若是遇到錯誤,參加錯誤記錄中的ivy錯誤。

編寫主類

有了插件,咱們定義一個主類來測試他。Main類執行的模擬抓取流程,並dump抓取的數據到data/readseg目錄。

新建類Main.java:

package com.mushan;import java.io.File;import java.util.Arrays;import org.apache.hadoop.util.ToolRunner;import org.apache.nutch.crawl.Generator;import org.apache.nutch.crawl.Injector;import org.apache.nutch.fetcher.Fetcher;import org.apache.nutch.segment.SegmentReader;import org.apache.nutch.util.NutchConfiguration;public class Main {  public static void main(String[] args) {
    String[] injectArgs = {"data/crawldb","urls/"};
    String[] generatorArgs = {"data/crawldb","data/segments","-noFilter"};
    String[] fetchArgs = {"data/segments/"};
    String[] readsegArgs = {"-dump","data/segments/","data/readseg","-noparsetext","-noparse","-noparsedata"};

    File dataFile = new File("data");    if(dataFile.exists()){
      print("delete");
      deleteDir(dataFile);
    }    try {
      ToolRunner.run(NutchConfiguration.create(), new Injector(), injectArgs);
      ToolRunner.run(NutchConfiguration.create(), new Generator(), generatorArgs);
      File segPath = new File("data/segments");
      String[] list = segPath.list();
      print(Arrays.asList(list));
      fetchArgs[0] = fetchArgs[0]+list[0];
      ToolRunner.run(NutchConfiguration.create(), new Fetcher(), fetchArgs);

      readsegArgs[1]+=list[0];
      SegmentReader.main(readsegArgs);
    } catch (Exception e) {
      e.printStackTrace();
    }

  }   private static boolean deleteDir(File dir) {          if (dir.isDirectory()) {
              String[] children = dir.list();              for (int i=0; i<children.length; i++) {                  boolean success = deleteDir(new File(dir, children[i]));                  if (!success) {                      return false;
                  }
              }
          }          return dir.delete();
      }   public static final void print(Object text){
    System.out.println(text);
  }
}

運行Main類。精彩的時候到了!!由於你會遇到不少錯誤。。。

錯誤記錄

ivy問題

BUILD FAILED
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build.xml:81: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\protocol-test\build.xml:4: The following error occurred while executing this line:
E:\project\java\nutch2015-2\apache-nutch-1.9\src\plugin\build-plugin.xml:47: Problem: failed to create task or type antlib:org.apache.ivy.ant:settingsCause: The name is undefined.

設置由於eclipse的內置ant沒有安裝ivy。Window > preference:

addivyjar

選擇你的ivy.jar文件便可。

設置文件權限錯誤

Generator: java.io.IOException: Failed to set permissions of path: \tmp\hadoop-mzb\mapred\staging\mzb1466704581\.staging to 0700

參見在eclipse中運行nutch

驗證

若是運行Main顯示的是:

...
Injector: starting at 2015-02-12 09:48:04
Injector: crawlDb: data/crawldb

...
Fetcher: finished at 2015-02-12 09:48:18, elapsed: 00:00:05
SegmentReader: dump segment: data/segments/20150212094811
SegmentReader: done

說明抓取成功!

打開data/readseg/dump,這個文件dump了抓取的數據:

Recno:: 0
URL:: http://nutch.apache.org/index.html

CrawlDatum::
Version: 7
Status: 1 (db_unfetched)
Fetch time: Thu Feb 12 09:48:07 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815

Content::
Version: -1
url: http://nutch.apache.org/index.html
base: http://nutch.apache.org/index.html
contentType: text/html
metadata: nutch.segment.name=20150212094811 _fst_=33 nutch.crawl.score=1.0 
Content:
hello worldCrawlDatum::
Version: 7
Status: 33 (fetch_success)
Fetch time: Thu Feb 12 09:48:14 CST 2015
Modified time: Thu Jan 01 08:00:00 CST 1970
Retries since fetch: 0
Retry interval: 2592000 seconds (30 days)
Score: 1.0
Signature: null
Metadata: 
  _ngt_=1423705689815
  _pst_=success(1), lastModified=0
  Content-Type=text/html

能夠看到content就是咱們返回的hello world。

參考網址

相關文章
相關標籤/搜索