Nutch源碼閱讀進程4

時間 2019-12-07

標籤 nutch 源碼閱讀進程简体版

原文原文鏈接

前面依次看了nutch的準備工做inject和generate部分，抓取的fetch部分的代碼，趁熱打鐵，咱們下面來一睹parse即頁面解析部分的代碼，這塊代碼主要是集中在ParseSegment類裏面，Let‘s go~~~

上期回顧：上回主要講的是nutch的fetch部分的功能代碼實現，主要是先將segments目錄下的指定文件夾做爲輸入，讀取裏面將要爬取的url信息存入爬取隊列，再根據用戶輸入的爬取的線程個數thread決定消費者的個數，線程安全地取出爬取隊列裏的url，而後在執行爬取頁面，解析頁面源碼得出url等操做，最終在segments目錄下生成content和crawl_fetch三個文件夾，下面來瞧瞧nutch的parse是個怎麼回事……

1.parse部分的入口從代碼 parseSegment.parse(segs[0]);開始，進入到ParseSegment類下的parse方法後，首先設置一個當前時間（方便後面比較結束時間之差來獲得整個parse所需的時間）。而後就是一個mapreduce過程，初始化了一個job，具體代碼以下：

JobConf job = new NutchJob(getConf());
job.setJobName("parse " + segment);

FileInputFormat.addInputPath(job, new Path(segment, Content.DIR_NAME));
job.set(Nutch.SEGMENT_NAME_KEY, segment.getName());
job.setInputFormat(SequenceFileInputFormat.class);
job.setMapperClass(ParseSegment.class);
job.setReducerClass(ParseSegment.class);

FileOutputFormat.setOutputPath(job, segment);
job.setOutputFormat(ParseOutputFormat.class);
job.setOutputKeyClass(Text.class);
job.setOutputValueClass(ParseImpl.class);

JobClient.runJob(job);

能夠看出設置的輸入爲segment文件夾下的文件，輸出也是segment文件夾，固然變化的是segment下生成了新的文件夾，提交的mapper和reducer都是parsesegment類。

2.下面就來分別看看ParseSegment類的map和reducer方法。map()方法中首先是一些斷定的代碼，該函數的主要功能仍是集中在如下代碼中：

ParseResult parseResult = null;
try {
parseResult = new ParseUtil(getConf()).parse(content);
} catch (Exception e) {
LOG.warn("Error parsing: " + key + ": " + StringUtils.stringifyException(e));
return;
}

for (Entry<Text, Parse> entry : parseResult) {
Text url = entry.getKey();//http://www.ahu.edu.cn/
Parse parse = entry.getValue();
ParseStatus parseStatus = parse.getData().getStatus();//success(1,0)
long start = System.currentTimeMillis();

reporter.incrCounter("ParserStatus", ParseStatus.majorCodes[parseStatus.getMajorCode()], 1);

if (!parseStatus.isSuccess()) {
LOG.warn("Error parsing: " + key + ": " + parseStatus);
parse = parseStatus.getEmptyParse(getConf());
}

// pass segment name to parse data
parse.getData().getContentMeta().set(Nutch.SEGMENT_NAME_KEY,
getConf().get(Nutch.SEGMENT_NAME_KEY));

// compute the new signature
byte[] signature =
SignatureFactory.getSignature(getConf()).calculate(content, parse);
parse.getData().getContentMeta().set(Nutch.SIGNATURE_KEY,
StringUtil.toHexString(signature));

try {
scfilters.passScoreAfterParsing(url, content, parse);
} catch (ScoringFilterException e) {
if (LOG.isWarnEnabled()) {
LOG.warn("Error passing score: "+ url +": "+e.getMessage());
}
}
long end = System.currentTimeMillis();
LOG.info("Parsed (" + Long.toString(end - start) + "ms):" + url);

output.collect(url, new ParseImpl(new ParseText(parse.getText()),
parse.getData(), parse.isCanonical()));
}

其中parseResult 是經過new ParseUtil(getConf()).parse(content);產生的，進入ParseUtil咱們能夠看出該函數全貌以下：

public ParseUtil(Configuration conf) {
this.parserFactory = new ParserFactory(conf);
MAX_PARSE_TIME=conf.getInt("parser.timeout", 30);
}

而ParserFactory就是調用一個插件來解決頁面解析這部分問題的，ParseFactory的代碼以下：

public ParserFactory(Configuration conf) {
this.conf = conf;
ObjectCache objectCache = ObjectCache.get(conf);
this.extensionPoint = PluginRepository.get(conf).getExtensionPoint(
Parser.X_POINT_ID);
this.parsePluginList = (ParsePluginList)objectCache.getObject(ParsePluginList.class.getName());
if (this.parsePluginList == null) {
this.parsePluginList = new ParsePluginsReader().parse(conf);
objectCache.setObject(ParsePluginList.class.getName(), this.parsePluginList);
}

if (this.extensionPoint == null) {
throw new RuntimeException("x point " + Parser.X_POINT_ID + " not found.");
}
if (this.parsePluginList == null) {
throw new RuntimeException(
"Parse Plugins preferences could not be loaded.");
}
}

固然了，如何調用插件來解決這個問題做者還不是很清楚，可是隱約從代碼中已經看到了PluginRepository（插件倉庫）、extensionPoint （擴展點）這樣的名詞了。

讓咱們再回到map方法，經過調試咱們能夠看到ParseResult包含了如下信息：

Version: -1
url: http://www.ahu.edu.cn/
base: http://www.ahu.edu.cn/
contentType: application/xhtml+xml
metadata: Date=Sat, 02 Aug 2014 13:46:36 GMT nutch.crawl.score=1.0 _fst_=33 nutch.segment.name=20140802214742 Content-Type=text/html Connection=close Accept-Ranges=bytes Server=Apache/2.2.8 (Unix) mod_ssl/2.2.8 OpenSSL/0.9.8e-fips-rhel5 DAV/2 Resin/3.0.25
Content:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />……

隨後再經過一個for循環，遍歷出其中的解析的詳細內容，咱們能夠看到 Text url = entry.getKey();就是獲得當前要解析的url，緊接着執行Parse parse = entry.getValue();其中的Text屬性就是解析後的網頁的主體信息即過濾了一些網頁標籤後的結果。剩下的代碼主要實現將解析的內容collect出去。

3.執行完map方法後就是reduce，reducer的代碼很簡潔就一行： output.collect(key, (Writable)values.next()); // collect first value，自帶的註解「collect first value」大概的意思就是map中每次只針對某一個url進行處理，因此收集到的解析的<text,parse>也就是惟一一個，本身的拙見啦~~~至此整個parse的過程就執行完畢了。

4.關於segment文件夾下的crawl_parse,parse_data,parse_text三個文件夾是如何生成的，咱們能夠看看上面job的輸出ParseOutputFormat類。進入該類的主體方法getRecordWriter()，首先是一些初始化和變量的賦值，好比url過濾器、url規格化對象的生成，時間間隔、解析的上限等變量的賦值。而後經過如下三行代碼定義輸出目錄：

Path text = new Path(new Path(out, ParseText.DIR_NAME), name); // parse_text

Path data = new Path(new Path(out, ParseData.DIR_NAME), name);//parse_data Path crawl = new Path(new Path(out, CrawlDatum.PARSE_DIR_NAME), name);//crawl_parse

而後再經過如下三個方法生成這三個目錄

final MapFile.Writer textOut =
new MapFile.Writer(job, fs, text.toString(), Text.class, ParseText.class,
CompressionType.RECORD, progress);
final MapFile.Writer dataOut =
new MapFile.Writer(job, fs, data.toString(), Text.class, ParseData.class,
compType, progress);
final SequenceFile.Writer crawlOut =
SequenceFile.createWriter(fs, job, crawl, Text.class, CrawlDatum.class,
compType, progress);

以上就是對於parse過程的一個簡單解析，相比前面的三個流程來講，parse模塊的實現邏輯相對簡單。。。

（備註：涉及到ParseOutputFormat部分還有一些東西沒有搞懂，下面的參考博文給了詳細的解釋，有興趣能夠拜讀下）

參考博文：http://blog.csdn.net/amuseme_lu/article/details/6727516

友情贊助html

若是你以爲博主的文章對你那麼一點小幫助，恰巧你又有想打賞博主的小衝動，那麼事不宜遲，趕忙掃一掃，小額地贊助下，攢個奶粉錢，也是讓博主有動力繼續努力，寫出更好的文章^^。安全

　　　　1. 支付寶　　　　　　　　　　　　　　　　　　　　　　　　　　2. 微信微信