本人也是菜鳥一枚,如今剛開始接觸爬蟲,想經過讀別人的爬蟲框架源碼來了解下爬蟲,若有錯誤,請見諒並指出。 html
繼以前解析了crawler4j的robotstxt包以後,今天來讓咱們看看crawler包和exception包。java
crawler包中主要有如下幾個類:web
1.Configurable:抽象配置類,這是一個抽象類,裏面有一個CrawlConfig的引用。其餘什麼也沒有了。服務器
2.CrawlConfig:這是一個爬蟲的具體配置類,裏面有許多參數,這裏我只介紹幾個主要的可配置的參數。框架
resumableCrawling:這個變量用來控制那些已經中止的爬蟲是否能夠恢復。(開啓了會讓爬的效率下降)ide
maxDepthOfCrawling:所爬取的最大深度。第一個頁面是0的話,則從該頁面裏獲取到的下一個頁面深度就是1,以此類推,達到最大深度後的頁面下的url都不會加入url隊列。函數
maxPagesToFetch:最多的爬取數量。 fetch
politenessDelay:發送2個請求的間隔時間this
includeBinaryContentInCrawling和processBinaryContentInCrawling:是否處理二進制內容,如圖像。url
userAgentString:爬蟲名
proxyHost和proxyPort:代理服務器地址和端口。(關於代理能夠自行百度,簡單說下,就是你的爬蟲先向代理服務器發送http請求,若代理服務器上有最新結果直接返回,不然由代理服務器向web服務器發送並獲得結果返回)
還有一些參數就不一一講了。(有些http鏈接和超時的參數,還有些沒搞懂,如onlineTldListUpdate)
3.WebCrawler:爬蟲類,實現了Runnable。既然實現了Runnable這裏首先來看下run方法
public void run() { onStart(); while (true) { List<WebURL> assignedURLs = new ArrayList<>(50); isWaitingForNewURLs = true; frontier.getNextURLs(50, assignedURLs); isWaitingForNewURLs = false; if (assignedURLs.isEmpty()) { if (frontier.isFinished()) { return; } try { Thread.sleep(3000); } catch (InterruptedException e) { logger.error("Error occurred", e); } } else { for (WebURL curURL : assignedURLs) { if (myController.isShuttingDown()) { logger.info("Exiting because of controller shutdown."); return; } if (curURL != null) { curURL = handleUrlBeforeProcess(curURL); processPage(curURL); frontier.setProcessed(curURL); } } } } }
onStart()方法默認是空方法,但咱們能夠重寫這個方法定製在爬蟲開始前的一些配置。而後本身的隊列裏沒有url了就從全局url隊列裏取,若也沒有則結束爬蟲。如有url且控制器未關閉則處理,最後告訴全局url控制器Frontier這個url處理過了。
接着咱們來看下處理頁的方法,這裏貼出主要邏輯
fetchResult = pageFetcher.fetchPage(curURL);//獲取結果集 Page page = new Page(curURL);//new 一個page page.setFetchResponseHeaders(fetchResult.getResponseHeaders()); page.setStatusCode(statusCode); //status code is 200 if (!curURL.getURL().equals(fetchResult.getFetchedUrl())) { if (docIdServer.isSeenBefore(fetchResult.getFetchedUrl())) { throw new RedirectException(Level.DEBUG, "Redirect page: " + curURL + " has already been seen"); } curURL.setURL(fetchResult.getFetchedUrl()); curURL.setDocid(docIdServer.getNewDocID(fetchResult.getFetchedUrl())); } parser.parse(page, curURL.getURL());//解析到page中 ParseData parseData = page.getParseData(); for (WebURL webURL : parseData.getOutgoingUrls()) { int newdocid = docIdServer.getDocId(webURL.getURL()); if (newdocid > 0) {//對於已經訪問過的,深度設置爲-1 // This is not the first time that this Url is visited. So, we set the depth to a negative number. webURL.setDepth((short) -1); webURL.setDocid(newdocid); }else {//加入url隊列 webURL.setDocid(-1); webURL.setDepth((short) (curURL.getDepth() + 1)); if (shouldVisit(page, webURL)) {//知足訪問要求,咱們能夠重寫此方法定製本身要訪問的頁面 webURL.setDocid(docIdServer.getNewDocID(webURL.getURL())); toSchedule.add(webURL); } } //加入全局url隊列 frontier.scheduleAll(toSchedule); //重寫此方法處理獲取到的html代碼 visit(page);
裏面有許多細節忽略了,不過一次狀態碼爲200的http請求處理過程大體是這樣了。
4.Page:表明一個頁面,存儲了頁面的相關信息
5.CrawlController:爬蟲控制器。這是一個總控制器,用來開啓爬蟲並監視各個爬蟲狀態。構造函數須要CrawlConfig,PageFetcher和RobotstxtServer。經過addSeed(String)方法來添加種子(最一開始爬蟲所爬的頁面),可添加多個。而後經過start方法開始爬取。start方法須要傳遞一個繼承WebCrawler的類的Class對象和開啓爬蟲的數量。讓咱們來看下這個start方法
for (int i = 1; i <= numberOfCrawlers; i++) {//建立爬蟲 T crawler = crawlerFactory.newInstance(); Thread thread = new Thread(crawler, "Crawler " + i); crawler.setThread(thread); crawler.init(i, this); thread.start(); crawlers.add(crawler); threads.add(thread); logger.info("Crawler {} started", i); } //接下來開啓一個監視線程, Thread monitorThread = new Thread(new Runnable() { @Override public void run() { try { synchronized (waitingLock) { while (true) { sleep(10); boolean someoneIsWorking = false; for (int i = 0; i < threads.size(); i++) {//檢查每一個爬蟲 Thread thread = threads.get(i); if (!thread.isAlive()) { if (!shuttingDown) {//重啓爬蟲 logger.info("Thread {} was dead, I'll recreate it", i); T crawler = crawlerFactory.newInstance(); thread = new Thread(crawler, "Crawler " + (i + 1)); threads.remove(i); threads.add(i, thread); crawler.setThread(thread); crawler.init(i + 1, controller); thread.start(); crawlers.remove(i); crawlers.add(i, crawler); } } else if (crawlers.get(i).isNotWaitingForNewURLs()) { someoneIsWorking = true; } } boolean shut_on_empty = config.isShutdownOnEmptyQueue(); //沒有爬蟲在工做且,隊列爲空時關閉 if (!someoneIsWorking && shut_on_empty) { // Make sure again that none of the threads // are // alive. logger.info("It looks like no thread is working, waiting for 10 seconds to make sure..."); sleep(10); someoneIsWorking = false; //再次檢查各個線程爬蟲 for (int i = 0; i < threads.size(); i++) { Thread thread = threads.get(i); if (thread.isAlive() && crawlers.get(i).isNotWaitingForNewURLs()) { someoneIsWorking = true; } } if (!someoneIsWorking) { if (!shuttingDown) { //隊列裏還有要爬取的頁面 long queueLength = frontier.getQueueLength(); if (queueLength > 0) { continue; } logger.info( "No thread is working and no more URLs are in queue waiting for another 10 seconds to make " + "sure..."); sleep(10); //又判斷了一次,這裏進行了2次判斷,防止出現僞結束 queueLength = frontier.getQueueLength(); if (queueLength > 0) { continue; } } //全部爬蟲都結束了,關閉各個服務 logger.info("All of the crawlers are stopped. Finishing the process..."); frontier.finish(); for (T crawler : crawlers) { crawler.onBeforeExit(); crawlersLocalData.add(crawler.getMyLocalData()); } logger.info("Waiting for 10 seconds before final clean up..."); sleep(10); frontier.close(); docIdServer.close(); pageFetcher.shutDown(); finished = true; waitingLock.notifyAll(); env.close(); return; } } } } } catch (Exception e) { logger.error("Unexpected Error", e); } } }); monitorThread.start();
其實還有許多細節沒有解析,不得不佩服大神啊,光是看看都以爲太厲害了。不過仍是但願能從這裏學到些東西的。