當咱們以Web UI方式使用Heritrix時,點擊任務開始(start)按鈕時,Heritrix就開始了它的爬取工做.但它的內部
執行流程是怎樣的呢?別急,下面將慢慢道來.
(一)CrawlJobHandler
當點擊任務開始(start)按鈕時,將執行它的startCrawler()方法:
if(sAction.equalsIgnoreCase("start"))
{
// Tell handler to start crawl job
handler.startCrawler();
}
再來看看startCrawler()方法的執行:
Code
public class CrawlJobHandler implements CrawlStatusListener {
public void startCrawler() {
running = true;
if (pendingCrawlJobs.size() > 0 && isCrawling() == false) {
// Ok, can just start the next job
startNextJob();
}
}
protected final void startNextJob() {
synchronized (this) {
if(startingNextJob != null) {
try {
startingNextJob.join();
} catch (InterruptedException e) {
e.printStackTrace();
return;
}
}
startingNextJob = new Thread(new Runnable() {
public void run() {
startNextJobInternal();
}
}, "StartNextJob");
//當前任務線程開始執行
startingNextJob.start();
}
}
protected void startNextJobInternal() {
if (pendingCrawlJobs.size() == 0 || isCrawling()) {
// No job ready or already crawling.
return;
}
//從待處理的任務列表取出一個任務
this.currentJob = (CrawlJob)pendingCrawlJobs.first();
assert pendingCrawlJobs.contains(currentJob) :
"pendingCrawlJobs is in an illegal state";
//從待處理列表中刪除
pendingCrawlJobs.remove(currentJob);
try {
this.currentJob.setupForCrawlStart();
// This is ugly but needed so I can clear the currentJob
// reference in the crawlEnding and update the list of completed
// jobs. Also, crawlEnded can startup next job.
this.currentJob.getController().addCrawlStatusListener(this);
// now, actually start
//控制器真正開始執行的地方
this.currentJob.getController().requestCrawlStart();
} catch (InitializationException e) {
loadJob(getStateJobFile(this.currentJob.getDirectory()));
this.currentJob = null;
startNextJobInternal(); // Load the next job if there is one.
}
}
}
由以上代碼不難發現整個流程以下:
能夠看出,最終將啓動CrawlController的requestCrawlStart()方法.
(二)CrawlController
該類是一次抓取任務中的核心組件。它將決定整個抓取任務的開始和結束.
先看看它的源代碼:
Code
package org.archive.crawler.framework;
public class CrawlController implements Serializable, Reporter {
![](http://static.javashuo.com/static/loading.gif)
// key subcomponents which define and implement a crawl in progress
private transient CrawlOrder order;
private transient CrawlScope scope;
private transient ProcessorChainList processorChains;
private transient Frontier frontier;
private transient ToePool toePool;
private transient ServerCache serverCache;
// This gets passed into the initialize method.
private transient SettingsHandler settingsHandler;
![](http://static.javashuo.com/static/loading.gif)
}
CrawlOrder:它保存了對該次抓取任務中order.xml的屬性配置。
CrawlScope:決定當前抓取範圍的一個組件。
ProcessorChainList:從名稱上可知,其表示處理器鏈。
Frontier:它是一個URL的處理器,決定下一個要被處理的URL是什麼。
ToePool:它表示一個線程池,管理了全部該抓取任務所建立的子線程。
ServerCache:它表示一個緩衝池,保存了全部在當前任務中,抓取過的Host名稱和Server名稱。
在構造 CrawlController實例,須要先作如下工做:
(1)首先構造一個XMLSettingsHandler對象,將order.xml內的屬性信息裝入,並調用它的initialize方法進行初始化。
(2)調用CrawlController構造函數,構造一個CrawlController實例
(3)調用CrawlController的initilize(SettingsHandler)方法,初始化CrawlController實例。其中,傳入的參數就是
在第一步裏構造的XMLSettingsHandler實例。
(4 )當上述3步完成後,CrawlController就具有了運行的條件。此時,只需調用它的requestCrawlStart()方法,就
能夠啓動線程池和Frontier,而後開始不斷的抓取網頁。
先來看看initilize(SettingsHandler)方法:
Code
public void initialize(SettingsHandler sH)
throws InitializationException {
sendCrawlStateChangeEvent(PREPARING, CrawlJob.STATUS_PREPARING);
this.singleThreadLock = new ReentrantLock();
this.settingsHandler = sH;
//從XMLSettingsHandler中取出Order
this.order = settingsHandler.getOrder();
this.order.setController(this);
this.bigmaps = new Hashtable<String,CachedBdbMap<?,?>>();
sExit = "";
this.manifest = new StringBuffer();
String onFailMessage = "";
try {
onFailMessage = "You must set the User-Agent and From HTTP" +
" header values to acceptable strings. \n" +
" User-Agent: [software-name](+[info-url])[misc]\n" +
" From: [email-address]\n";
//檢查了用戶設定的UserAgent等信息,看是否符合格式
order.checkUserAgentAndFrom();
onFailMessage = "Unable to setup disk";
if (disk == null) {
setupDisk(); //設定了開始抓取後保存文件信息的目錄結構
}
onFailMessage = "Unable to create log file(s)";
//初始化了日誌信息的記錄工具
setupLogs();
onFailMessage = "Unable to test/run checkpoint recover";
this.checkpointRecover = getCheckpointRecover();
if (this.checkpointRecover == null) {
this.checkpointer =
new Checkpointer(this, this.checkpointsDisk);
} else {
setupCheckpointRecover();
}
onFailMessage = "Unable to setup bdb environment.";
//初始化使用Berkley DB的一些工具
setupBdb();
onFailMessage = "Unable to setup statistics";
setupStatTracking();
onFailMessage = "Unable to setup crawl modules";
//初始化了Scope、Frontier以及ProcessorChain
setupCrawlModules();
} catch (Exception e) {
String tmp = "On crawl: "
+ settingsHandler.getSettingsObject(null).getName() + " " +
onFailMessage;
LOGGER.log(Level.SEVERE, tmp, e);
throw new InitializationException(tmp, e);
}
Lookup.getDefaultCache(DClass.IN).setMaxEntries(1);
//dns.getRecords("localhost", Type.A, DClass.IN);
//實例化線程池
setupToePool();
setThresholds();
reserveMemory = new LinkedList<char[]>();
for(int i = 1; i < RESERVE_BLOCKS; i++) {
reserveMemory.add(new char[RESERVE_BLOCK_SIZE]);
}
}
能夠看出在initilize()方法中主要作一些初始化工做,但這些對於Heritrix的運行是必需的.
再來看看CrawlController的核心,requestCrawlStart()方法:
Code
public void requestCrawlStart() {
//初始化處理器鏈
runProcessorInitialTasks();
sendCrawlStateChangeEvent(STARTED, CrawlJob.STATUS_PENDING);
String jobState;
state = RUNNING;
jobState = CrawlJob.STATUS_RUNNING;
sendCrawlStateChangeEvent(this.state, jobState);
// A proper exit will change this value.
this.sExit = CrawlJob.STATUS_FINISHED_ABNORMAL;
Thread statLogger = new Thread(statistics);
statLogger.setName("StatLogger");
//開始日誌線程
statLogger.start();
//啓運Frontier,抓取工做開始
frontier.start();
}
能夠看出,作了那麼多工做,最終將啓動Frontier的start方法,而Frontier將爲線程池的線程提供URI,真正開始
抓取任務.至此,抓取任務開始.
主要參考:開發本身的搜索引擎—Lucene 2.0+Heritrix
原始地址: