淺談Nutch插件機制（含開發實例）

時間 2019-12-01

標籤淺談 nutch 插件機制開發實例简体版

原文原文鏈接

plugin(插件)爲nutch提供了一些功能強大的部件，舉個例子，HtmlParser就是使用比較廣泛的用來分析nutch抓取的html文件的插件。html

爲何nutch要使用這樣的plugin系統?java

有三個緣由：程序員

1：可擴展性正則表達式

經過plugin，nutch容許任何人擴展它的功能，而咱們要作的只是對給定的接口作簡單的實現，舉個例子：MSWordParser這個插件是用來分析wordwendang的，它就是一個對parser這個接口的實現算法

2：靈活性express

由於每一個人均可以根據本身的需求而寫本身的plugin，這樣plugin就會有一個很強大的資源庫。這樣對與應用nutch程序員來講，他能夠在本身的搜索引擎上安裝符合本身需求的插件，而這些插件就在nutch的plugins中。這對於正在應用nutch的開發者來講應該是一個巨大的福音，由於你有了更多的關於內容抽取的算法來選擇，很容易就增長了pdf的分析。apache

3：可維護性api

每一個開發者只要關注本身的問題。對於內核的開發者在爲引擎內核擴展的同時，爲a plug添加一個描述它的接口就能夠了。一個plugin的開發者只要關注這個plugin所要實現的功能，而不須要知道整個系統是怎麼工做的。它們僅僅須要知道的是plugin和plug之間交換的數據類型。這使得內核更加簡單，更容易維護。網絡

plugin相關－－什麼是plugin，plugin的工做原理app

nutch的plugin系統是基於Eclipse 2.x中對插件的使用。plugins對nutch的工做是很重要的。全部的nutch中的parsing（分析），indexing（索引），searching（查詢）都是經過不一樣的plugins來實現的。

在編寫一個plugin的時候，你要爲一個擴展點添加一個或者更多的擴展項。這些Nutch的擴展點是Nutch在一個plugin中已經定義好了，這個plugin是NutchExtensionPoints（全部的擴展點都會在NutchExtensionPointsplugin.xml這個文件中列出）。每個擴展點都定義了一個接口，這個接口在擴展時必須被實現。這些擴展點以下：

onlineClusterer－爲在線的查詢結果提供分組算法的擴展點的接口

indexingFiltering－容許爲所索引中的Field添加元數據。全部的實現了這個接口plugin會在分析的過程當中順序的逐個運行

Ontology

Parser

實現這個接口的parser讀取所抓取的document，摘取將被索引的數據。若是你要在nutch中爲擴展分析一個新內容類型或者從現有的可分析的內容摘取更多的數據。

HtmlParseFilter

爲html parser添加額外的元數據

protocol

實現Protocol的plugin可使得nutch能使用更多的網絡協議（ftp，http）去抓取數據

QueryFilter

爲查詢轉換的擴展點

URLFileter

實現這個擴展點的plugin會對nutch要抓取的網頁的urls進行限制，RegexURLFilter提供了經過正則表達式來對Nutch爬行網頁的urls的控制。若是你對urls還有更加複雜的控制要求，你能夠編寫對這個urlfilter的實現

NutchAnalyser

爲許多語言特定的分析器提供了擴展點

源文件

在plugin的源文件目錄中你會找到如下的文件

plugin.xml 向nutch描述這個plugin的信息

build.xml 告訴ant怎樣編譯這個plugin

plugin的源碼

在Nutch使用plugin

若是要在Nutch使用一個給定的plugin，你須要對conf/nutch-site.xml進行編輯而且把plugin的名字添加到plugin.includes中

Nutch plugin系統中的一些概念

編寫一個pluginExample

思考編寫這樣的一個plugin：咱們想爲一個給定的search term推薦一些與之相關的網頁。舉個例子，假設咱們正在索引網頁，當咱們注意到有一組網頁的內容是關於plugin的，因此咱們想若是當某人查詢plugin的時候，咱們能夠推薦他到pluginCentral這個網頁，可是同時，也要返回在通常邏輯中的查詢結果全部的hits。因此咱們將查詢結果分紅了推薦結果和通常查詢結果。

你瀏覽你的網頁而後把一個meta－tags加入網頁中，它會告訴plugin這個網頁是應該被推薦的。這個tags應該像這樣

<meta name="recommended" content="plugins" />

爲了達到目標咱們須要寫一個plugin，這個plugin擴展3個不一樣的擴展點，它們是：

HTMLParser：從meta－tags獲得推薦的terms

IndexingFilter：增長一個推薦Field在索引中。

QueryFilter：增長對索引中新Field的查詢能力

 要創建的文件

 首先在plugin的目錄中新建一個目錄來盛放本身的的plugin，整個plugin咱們取名爲recommended，而後在這個目錄裏面

 依次創建如下文件:

 a plugin.xml,這個文件用來向Nutch描述咱們新建的plugin

 a build.xml這個文件告訴編譯器應該怎樣build這個plugin

 plugin的源代碼則保存在/src/java/org/apache/nutch/parse/recommended/[這裏]

Plugin.xml

你所創建的plugin.xml應該這樣：

<?xml version="1.0" encoding="UTF-8"?>
<plugin
id="recommended"
name="Recommended Parser/Filter"
version="0.0.1"
provider-name="nutch.org">
<runtime>
<!-- As defined in build.xml this plugin will end up bundled as recommended.jar -->
<library name="recommended.jar">
<export name="*"/>
</library>
</runtime>
<!-- The RecommendedParser extends the HtmlParseFilter to grab the contents of
any recommended meta tags -->
<extension id="org.apache.nutch.parse.recommended.recommendedfilter"
name="Recommended Parser"
point="org.apache.nutch.parse.HtmlParseFilter">
<implementation id="RecommendedParser"
class="org.apache.nutch.parse.recommended.RecommendedParser"/>
</extension>
<!-- TheRecommendedIndexer extends the IndexingFilter in order to add the contents
of the recommended meta tags (as found by the RecommendedParser) to the lucene
index. -->
<extension id="org.apache.nutch.parse.recommended.recommendedindexer"
name="Recommended identifier filter"
point="org.apache.nutch.indexer.IndexingFilter">
<implementation id="RecommendedIndexer"
class="org.apache.nutch.parse.recommended.RecommendedIndexer"/>
</extension>
<!-- The RecommendedQueryFilter gets called when you perform a search. It runs a
search for the user's query against the recommended fields.    In order to get
add this to the list of filters that gets run by default, you have to use
"fields=DEFAULT". -->
<extension id="org.apache.nutch.parse.recommended.recommendedSearcher"
name="Recommended Search Query Filter"
point="org.apache.nutch.searcher.QueryFilter">
<implementation id="RecommendedQueryFilter"
class="org.apache.nutch.parse.recommended.RecommendedQueryFilter"
fields="DEFAULT"/>
</extension>
</plugin>

Build.xml

<?xml version="1.0"?>
<project name="recommended" default="jar">
<import file="../build-plugin.xml"/>
</project>

The HTML Parser Extension

這是對HtmlParserExtension這個擴展點的實現，它的做用是抓取那些標meta－tags的內容，這樣把它們加入正在分析的document中。.

package org.apache.nutch.parse.recommended;
// JDK imports
import java.util.Enumeration;
import java.util.Properties;
import java.util.logging.Logger;
// Nutch imports
import org.apache.nutch.parse.HTMLMetaTags;
import org.apache.nutch.parse.Parse;
import org.apache.nutch.parse.HtmlParseFilter;
import org.apache.nutch.protocol.Content;
import org.apache.nutch.util.LogFormatter;
public class RecommendedParser implements HtmlParseFilter {
private static final Logger LOG = LogFormatter
.getLogger(RecommendedParser.class.getName());
/** The Recommended meta data attribute name */
public static final String META_RECOMMENDED_NAME="Recommended";
/**
* 在html文件中尋找有沒有標有meta－tags的內容     */
public Parse filter(Content content, Parse parse, HTMLMetaTags metaTags, DocumentFragment doc) {
// Trying to find the document's recommended term
String recommendation = null;
Properties generalMetaTags = metaTags.getGeneralTags();
for (Enumeration tagNames = generalMetaTags.propertyNames(); tagNames.hasMoreElements(); ) {
if (tagNames.nextElement().equals("recommended")) {
recommendation = generalMetaTags.getProperty("recommended");
LOG.info("Found a Recommendation for " + recommendation);
}
}
if (recommendation == null) {
LOG.info("No Recommendataion");
} else {
LOG.info("Adding Recommendation for " + recommendation);
parse.getData().getMetadata().put(META_RECOMMENDED_NAME, recommendation);
}
return parse;
}
}

The Indexer Extension

這是對索引點的擴展，若是查找到帶有meta－tags的內容，就把它命名」recommended’的field中，而後加入document創建索引

package org.apache.nutch.parse.recommended;
// JDK import
import java.util.logging.Logger;
// Nutch imports
import org.apache.nutch.util.LogFormatter;
import org.apache.nutch.fetcher.FetcherOutput;
import org.apache.nutch.indexer.IndexingFilter;
import org.apache.nutch.indexer.IndexingException;
import org.apache.nutch.parse.Parse;
// Lucene imports
import org.apache.lucene.document.Field;
import org.apache.lucene.document.Document;
public class RecommendedIndexer implements IndexingFilter {
public static final Logger LOG
= LogFormatter.getLogger(RecommendedIndexer.class.getName());
public RecommendedIndexer() {
}
public Document filter(Document doc, Parse parse, FetcherOutput fo)
throws IndexingException {
String recommendation = parse.getData().get("Recommended");
if (recommendation != null) {
Field recommendedField = Field.Text("recommended", recommendation);
recommendedField.setBoost(5.0f);
doc.add(recommendedField);
LOG.info("Added " + recommendation + " to the recommended Field");
}
return doc;
}
}

The QueryFilter

當用戶進行查找操做的時候，QueryFilter就會被調用，並且對於recommeded中的boost值會影響查詢結果的排序。

package org.apache.nutch.parse.recommended;
import org.apache.nutch.searcher.FieldQueryFilter;
import java.util.logging.Logger;
import org.apache.nutch.util.LogFormatter;
public class RecommendedQueryFilter extends FieldQueryFilter {
private static final Logger LOG = LogFormatter
.getLogger(RecommendedParser.class.getName());
public RecommendedQueryFilter() {
super("recommended", 5f);
LOG.info("Added a recommended query");
}
}

讓Nutch可使用你的plugin

爲了讓Nutch使用你的plugin，你須要對conf/nuthc-site.xml這個文件進行編輯，把

如下的代碼加入

<property>
<name>plugin.includes</name>
<value>nutch-extensionpoints|protocol-http|urlfilter-regex|parse-(text|html)|index-basic|query-(basic|site|url)|recommended

</value>
<description>Regular expression naming plugin directory names to
include.    Any plugin not matching this expression is excluded.
In any case you need at least include the nutch-extensionpoints plugin. By
default Nutch includes crawling just HTML and plain text via HTTP,
and basic indexing and search plugins.
</description>
</property>

使用Ant對你的plugin進行編譯

在以前咱們要編輯一下src/plugin/build.xml 這個文件，這是對編譯和部署作一些設置

你會看到有不少以下形式的行

<ant dir="[plugin-name]" target="deploy" />

在</target>前添加一新行

    <ant dir="reccomended" target="deploy" />

Running ‘ant’ in the root of your checkout directory should get everything compiled and jared up. The next time you run a crawl your parser and index filter should get used.

You’ll need to run ‘ant war’ to compile a new ROOT.war file. Once you’ve deployed that, your query filter should get used when searches are performed.

plugins和class加載到nutch的問題集合

對plugin開發者來講最棒的事情就是自由了，能夠不去理會別的plugin的開發者在作什麼，能夠自由的使用第三方的jar庫。

nutch是怎樣解決類加載這個問題的？

Nutch使用了一個很是容易的方法，每個plugin都有一個屬於本身的類加載器，這個class－loader在plugin啓動之前將會被初始化

寫plugin－by stefan

nutch 0.7中的plugins

若是你要在nutch中應用這些插件，你只須要編輯conf/nutch-site.xml，把你所要用的plugin的名字加入plugin.includes的列表中

clustering-carrot2 – Online Search Results Clustering using Carrot2’s Lingo component.
creativecommons – Support for crawling and searching Creative-Commons licensed content.
index-basic – Adds url, content and anchor fields to the index.
index-more – Adds date, content-length, contentType, primaryType and subtype fields to the index.
languageidentifier – Adds a lang field to the index and allows you to query against it.
ontology – Helps refine queries based on owl files.
parse-ext – A wrapper that invokes external command to do real parsing job.
parse-html – Parses HTML documents
parse-js – Parses JavaScript
parse-mp3 – Parses MP3s
parse-msword – Parses MS Word documents
parse-pdf – Parses PDFs
parse-rss – Parses RSS feeds
parse-rtf – Parses RTF files
parse-text – Parses text documents
protocol-file – Retreives documents from the filesystem
protocol-ftp – Retreives documents through ftp
protocol-http – Retreives documents through http
protocol-httpclient – Retreives documents through http and https
query-basic – Runs queries against content, url and anchor fields
query-more – Runs queries against date, content-length, contentType, primaryType and subType fields.
query-site – Runs queries against site field
query-url – Runs queries against url field.
urlfilter-prefix
urlfilter-regex

Additional Plugins in Dev Branch (0.8)

analysis-de
analysis-fr
lib-commons-httpclient
lib-http
lib-jakarta-poi
lib-log4j
lib-lucene-analyzers
lib-nekohtml
lib-parsems
parse-msexcel – Parses MS Excel documents
parse-mspowerpoint – Parses MS Powerpoint documents
parse-oo – Parses Open Office and Star Office documents (Extentsions: ODT, OTT, ODH, ODM, ODS, OTS, ODP, OTP, SXW, STW, SXC, STC, SXI, STI)
parse-swf – Parses Flash SWF files
microformats-reltag – Adds rel-tag fields to the index and runs queries against them.
parse-zip