這個爬蟲是近半個月前學習爬蟲技術的一個小例子,比較簡單,怕時間久了會忘,這裏簡單總結一下.主要用到的外部Jar包有HttpClient4.3.4,HtmlParser2.1,使用的開發工具(IDE)爲intelij 13.1,Jar包管理工具爲Maven,不習慣用intelij的同窗,也可使用eclipse新建一個項目.html
網絡爬蟲,拆開來說,網絡即指互聯網,互聯網就像一個蜘蛛網同樣,爬蟲就像是蜘蛛同樣能夠處處爬來爬去,把爬來的數據再進行加工處理.java
百科上的解釋:網絡爬蟲(又被稱爲網頁蜘蛛,網絡機器人,在FOAF社區中間,更常常的稱爲網頁追逐者),是一種按照必定的規則,自動的抓取萬維網信息的程序或者腳本。另一些不常使用的名字還有螞蟻,自動索引,模擬程序或者蠕蟲。node
基本原理:傳統爬蟲從一個或若干初始網頁的URL開始,得到初始網頁上的URL,在抓取網頁的過程當中,不斷從當前頁面上抽取新的URL放入隊列,直到知足系統的必定中止條件,流程圖所示。聚焦爬蟲的工做流程較爲複雜,須要根據必定的網頁分析算法過濾與主題無關的連接,保留有用的連接並將其放入等待抓取的URL隊列。而後,它將根據必定的搜索策略從隊列中選擇下一步要抓取的網頁URL,並重覆上述過程,直到達到系統的某一條件時中止web
網頁的抓取策略能夠分爲深度優先、廣度優先和最佳優先三種。深度優先在不少狀況下會致使爬蟲的陷入(trapped)問題,目前常見的是廣度優先和最佳優先方法。算法
2.1廣度優先(Width-First)apache
廣度優先遍歷是連通圖的一種遍歷策略。由於它的思想是從一個頂點V0開始,輻射狀地優先遍歷其周圍較廣的區域,故得名.cookie
其基本思想:網絡
以下圖所示:app
2.2深度優先(Depth-First)eclipse
下面以一個有向圖和一個無向圖爲例:
廣度和深度和區別:
廣度優先遍歷是以層爲順序,將某一層上的全部節點都搜索到了以後才向下一層搜索;而深度優先遍歷是將某一條枝椏上的全部節點都搜索到了以後,才轉向搜索另外一條枝椏上的全部節點。
2.3 最佳優先搜索
最佳優先搜索策略按照必定的網頁分析算法,預測候選URL與目標網頁的類似度,或與主題的相關性,並選取評價最好的一個或幾個URL進行抓取。它只訪問通過網頁分析算法預測爲「有用」的網頁。這種搜索適合暗網數據的爬取,只要符合要求的內容.
本文介紹的例子是抓取新聞類的信息,由於通常新聞類的信息,重要的和時間近的都會放在首頁,處在網絡層中比較深的信息的重要性通常將逐級下降,因此廣度優先算法更適合,下圖是本文將要抓取的網頁結構圖:
這裏只抓取100條信息,並用url必須以new.fudan.edu.cn開頭.
使用maven引入外部jar包:
<dependency> <groupId>org.apache.httpcomponents</groupId> <artifactId>httpclient</artifactId> <version>4.3.4</version> </dependency> <dependency> <groupId>org.htmlparser</groupId> <artifactId>htmlparser</artifactId> <version>2.1</version> </dependency>
程序主入口:
package com.amos.crawl; import java.util.Set; /** * Created by amosli on 14-7-10. */ public class MyCrawler { /** * 使用種子初始化URL隊列 * * @param seeds */ private void initCrawlerWithSeeds(String[] seeds) { for (int i = 0; i < seeds.length; i++) { LinkQueue.addUnvisitedUrl(seeds[i]); } } public void crawling(String[] seeds) { //定義過濾器,提取以http://news.fudan.edu.cn/的連接 LinkFilter filter = new LinkFilter() { @Override public boolean accept(String url) { if (url.startsWith("http://news.fudan.edu.cn")) { return true; } return false; } }; //初始化URL隊列 initCrawlerWithSeeds(seeds); int count=0; //循環條件:待抓取的連接不爲空抓取的網頁最多100條 while (!LinkQueue.isUnvisitedUrlsEmpty() && LinkQueue.getVisitedUrlNum() <= 100) { System.out.println("count:"+(++count)); //附頭URL出隊列 String visitURL = (String) LinkQueue.unVisitedUrlDeQueue(); DownLoadFile downloader = new DownLoadFile(); //下載網頁 downloader.downloadFile(visitURL); //該URL放入怩訪問的URL中 LinkQueue.addVisitedUrl(visitURL); //提取出下載網頁中的URL Set<String> links = HtmlParserTool.extractLinks(visitURL, filter); //新的未訪問的URL入列 for (String link : links) { System.out.println("link:"+link); LinkQueue.addUnvisitedUrl(link); } } } public static void main(String args[]) { //程序入口 MyCrawler myCrawler = new MyCrawler(); myCrawler.crawling(new String[]{"http://news.fudan.edu.cn/news/"}); } }
工具類:Tools.java
package com.amos.tool; import java.io.*; import java.net.URI; import java.net.URISyntaxException; import java.net.UnknownHostException; import java.security.KeyManagementException; import java.security.KeyStoreException; import java.security.NoSuchAlgorithmException; import java.security.cert.CertificateException; import java.security.cert.X509Certificate; import java.util.Locale; import javax.net.ssl.SSLContext; import javax.net.ssl.SSLException; import org.apache.http.*; import org.apache.http.client.CircularRedirectException; import org.apache.http.client.CookieStore; import org.apache.http.client.HttpRequestRetryHandler; import org.apache.http.client.RedirectStrategy; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.methods.HttpHead; import org.apache.http.client.methods.HttpUriRequest; import org.apache.http.client.methods.RequestBuilder; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.client.utils.URIBuilder; import org.apache.http.client.utils.URIUtils; import org.apache.http.conn.ConnectTimeoutException; import org.apache.http.conn.HttpClientConnectionManager; import org.apache.http.conn.ssl.SSLConnectionSocketFactory; import org.apache.http.conn.ssl.SSLContextBuilder; import org.apache.http.conn.ssl.TrustStrategy; import org.apache.http.cookie.Cookie; import org.apache.http.impl.client.*; import org.apache.http.impl.conn.BasicHttpClientConnectionManager; import org.apache.http.impl.cookie.BasicClientCookie; import org.apache.http.protocol.HttpContext; import org.apache.http.util.Args; import org.apache.http.util.Asserts; import org.apache.http.util.TextUtils; import org.omg.CORBA.Request; /** * Created by amosli on 14-6-25. */ public class Tools { /** * 寫文件到本地 * * @param httpEntity * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try { File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); InputStream inputStream = httpEntity.getContent(); byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) { fileOutputStream.write(bytes, 0, length); } inputStream.close(); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } } /** * 寫文件到本地 * * @param bytes * @param filename */ public static void saveToLocalByBytes(byte[] bytes, String filename) { try { File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); fileOutputStream.write(bytes); //fileOutputStream.write(bytes, 0, bytes.length); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } } /** * 輸出 * @param string */ public static void println(String string){ System.out.println("string:"+string); } /** * 輸出 * @param string */ public static void printlnerr(String string){ System.err.println("string:"+string); } /** * 使用ssl通道並設置請求重試處理 * @return */ public static CloseableHttpClient createSSLClientDefault() { try { SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任全部 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true; } }).build(); SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //設置請求重試處理,重試機制,這裏若是請求失敗會重試5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() { @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectTimeoutException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpClientContext clientContext = HttpClientContext.adapt(context); HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } }; //請求參數設置,設置請求超時時間爲20秒,鏈接超時爲10秒,不容許循環重定向 RequestConfig requestConfig = RequestConfig.custom() .setConnectionRequestTimeout(20000).setConnectTimeout(20000) .setCircularRedirectsAllowed(false) .build(); Cookie cookie ; return HttpClients.custom().setSSLSocketFactory(sslsf) .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") .setMaxConnPerRoute(25).setMaxConnPerRoute(256) .setRetryHandler(retryHandler) .setRedirectStrategy(new SelfRedirectStrategy()) .setDefaultRequestConfig(requestConfig) .build(); } catch (KeyManagementException e) { e.printStackTrace(); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (KeyStoreException e) { e.printStackTrace(); } return HttpClients.createDefault(); } /** * 帶cookiestore * @param cookieStore * @return */ public static CloseableHttpClient createSSLClientDefaultWithCookie(CookieStore cookieStore) { try { SSLContext sslContext = new SSLContextBuilder().loadTrustMaterial(null, new TrustStrategy() { //信任全部 public boolean isTrusted(X509Certificate[] chain,String authType) throws CertificateException { return true; } }).build(); SSLConnectionSocketFactory sslsf = new SSLConnectionSocketFactory(sslContext); //設置請求重試處理,重試機制,這裏若是請求失敗會重試5次 HttpRequestRetryHandler retryHandler = new HttpRequestRetryHandler() { @Override public boolean retryRequest(IOException exception, int executionCount, HttpContext context) { if (executionCount >= 5) { // Do not retry if over max retry count return false; } if (exception instanceof InterruptedIOException) { // Timeout return false; } if (exception instanceof UnknownHostException) { // Unknown host return false; } if (exception instanceof ConnectTimeoutException) { // Connection refused return false; } if (exception instanceof SSLException) { // SSL handshake exception return false; } HttpClientContext clientContext = HttpClientContext.adapt(context); HttpRequest request = clientContext.getRequest(); boolean idempotent = !(request instanceof HttpEntityEnclosingRequest); if (idempotent) { // Retry if the request is considered idempotent return true; } return false; } }; //請求參數設置,設置請求超時時間爲20秒,鏈接超時爲10秒,不容許循環重定向 RequestConfig requestConfig = RequestConfig.custom() .setConnectionRequestTimeout(20000).setConnectTimeout(20000) .setCircularRedirectsAllowed(false) .build(); return HttpClients.custom().setSSLSocketFactory(sslsf) .setUserAgent("Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36") .setMaxConnPerRoute(25).setMaxConnPerRoute(256) .setRetryHandler(retryHandler) .setRedirectStrategy(new SelfRedirectStrategy()) .setDefaultRequestConfig(requestConfig) .setDefaultCookieStore(cookieStore) .build(); } catch (KeyManagementException e) { e.printStackTrace(); } catch (NoSuchAlgorithmException e) { e.printStackTrace(); } catch (KeyStoreException e) { e.printStackTrace(); } return HttpClients.createDefault(); } }
將網頁寫入到本地的下載類:DownLoadFile.java
package com.amos.crawl; import com.amos.tool.Configuration; import com.amos.tool.Tools; import org.apache.http.*; import org.apache.http.client.HttpClient; import org.apache.http.client.HttpRequestRetryHandler; import org.apache.http.client.config.RequestConfig; import org.apache.http.client.methods.HttpGet; import org.apache.http.client.protocol.HttpClientContext; import org.apache.http.conn.ClientConnectionManager; import org.apache.http.conn.ConnectTimeoutException; import org.apache.http.impl.client.AutoRetryHttpClient; import org.apache.http.impl.client.DefaultHttpClient; import org.apache.http.protocol.HttpContext; import javax.net.ssl.SSLException; import java.io.*; import java.net.UnknownHostException; /** * Created by amosli on 14-7-9. */ public class DownLoadFile { public String getFileNameByUrl(String url, String contentType) { //移除http http:// url = url.contains("http://") ? url.substring(7) : url.substring(8); //text/html類型 if (url.contains(".html")) { url = url.replaceAll("[\\?/:*|<>\"]", "_"); } else if (contentType.indexOf("html") != -1) { url = url.replaceAll("[\\?/:*|<>\"]", "_") + ".html"; } else { url = url.replaceAll("[\\?/:*|<>\"]", "_") + "." + contentType.substring(contentType.lastIndexOf("/") + 1); } return url; } /** * 將網頁寫入到本地 * @param data * @param filePath */ private void saveToLocal(byte[] data, String filePath) { try { DataOutputStream out = new DataOutputStream(new FileOutputStream(new File(filePath))); for(int i=0;i<data.length;i++){ out.write(data[i]); } out.flush(); out.close(); } catch (Exception e) { e.printStackTrace(); } } /** * 寫文件到本地 * * @param httpEntity * @param filename */ public static void saveToLocal(HttpEntity httpEntity, String filename) { try { File dir = new File(Configuration.FILEDIR); if (!dir.isDirectory()) { dir.mkdir(); } File file = new File(dir.getAbsolutePath() + "/" + filename); FileOutputStream fileOutputStream = new FileOutputStream(file); InputStream inputStream = httpEntity.getContent(); if (!file.exists()) { file.createNewFile(); } byte[] bytes = new byte[1024]; int length = 0; while ((length = inputStream.read(bytes)) > 0) { fileOutputStream.write(bytes, 0, length); } inputStream.close(); fileOutputStream.close(); } catch (Exception e) { e.printStackTrace(); } } public String downloadFile(String url) { //文件路徑 String filePath=null; //1.生成HttpClient對象並設置參數 HttpClient httpClient = Tools.createSSLClientDefault(); //2.HttpGet對象並設置參數 HttpGet httpGet = new HttpGet(url); //設置get請求超時5s //方法1 //httpGet.getParams().setParameter("connectTimeout",5000); //方法2 RequestConfig requestConfig = RequestConfig.custom().setConnectTimeout(5000).build(); httpGet.setConfig(requestConfig); try { HttpResponse httpResponse = httpClient.execute(httpGet); int statusCode = httpResponse.getStatusLine().getStatusCode(); if(statusCode!= HttpStatus.SC_OK){ System.err.println("Method failed:"+httpResponse.getStatusLine()); filePath=null; } filePath=getFileNameByUrl(url,httpResponse.getEntity().getContentType().getValue()); saveToLocal(httpResponse.getEntity(),filePath); } catch (Exception e) { e.printStackTrace(); } return filePath; } public static void main(String args[]) throws IOException { String url = "http://websearch.fudan.edu.cn/search_dep.html"; HttpClient httpClient = new DefaultHttpClient(); HttpGet httpGet = new HttpGet(url); HttpResponse httpResponse = httpClient.execute(httpGet); Header contentType = httpResponse.getEntity().getContentType(); System.out.println("name:" + contentType.getName() + "value:" + contentType.getValue()); System.out.println(new DownLoadFile().getFileNameByUrl(url, contentType.getValue())); } }
建立一個過濾接口:LinkFilter.java
package com.amos.crawl; /** * Created by amosli on 14-7-10. */ public interface LinkFilter { public boolean accept(String url); }
使用HtmlParser的過濾url的方法:HtmlParserTool.java
package com.amos.crawl; import org.htmlparser.Node; import org.htmlparser.NodeFilter; import org.htmlparser.Parser; import org.htmlparser.filters.NodeClassFilter; import org.htmlparser.filters.OrFilter; import org.htmlparser.tags.LinkTag; import org.htmlparser.util.NodeList; import java.util.HashSet; import java.util.Set; /** * Created by amosli on 14-7-10. */ public class HtmlParserTool { public static Set<String> extractLinks(String url, LinkFilter filter) { Set<String> links = new HashSet<String>(); try { Parser parser = new Parser(url); parser.setEncoding("GBK"); //過濾<frame>標籤的filter,用來提取frame標籤裏的src屬性 NodeFilter framFilter = new NodeFilter() { @Override public boolean accept(Node node) { if (node.getText().contains("frame src=")) { return true; } else { return false; } } }; //OrFilter來設置過濾<a>標籤和<frame>標籤 OrFilter linkFilter = new OrFilter(new NodeClassFilter(LinkTag.class), framFilter); //獲得全部通過過濾的標籤 NodeList list = parser.extractAllNodesThatMatch(linkFilter); for (int i = 0; i < list.size(); i++) { Node tag = list.elementAt(i); if (tag instanceof LinkTag) { tag = (LinkTag) tag; String linkURL = ((LinkTag) tag).getLink(); //若是符合條件那麼將url添加進去 if (filter.accept(linkURL)) { links.add(linkURL); } } else {//frame 標籤 //frmae裏src屬性的連接,如<frame src="test.html" /> String frame = tag.getText(); int start = frame.indexOf("src="); frame = frame.substring(start); int end = frame.indexOf(" "); if (end == -1) { end = frame.indexOf(">"); } String frameUrl = frame.substring(5, end - 1); if (filter.accept(frameUrl)) { links.add(frameUrl); } } } } catch (Exception e) { e.printStackTrace(); } return links; } }
管理網頁url的實現隊列: Queue.java
package com.amos.crawl; import java.util.LinkedList; /** * Created by amosli on 14-7-9. */ public class Queue { //使用鏈表實現隊列 private LinkedList queueList = new LinkedList(); //入隊列 public void enQueue(Object object) { queueList.addLast(object); } //出隊列 public Object deQueue() { return queueList.removeFirst(); } //判斷隊列是否爲空 public boolean isQueueEmpty() { return queueList.isEmpty(); } //判斷隊列是否包含ject元素.. public boolean contains(Object object) { return queueList.contains(object); } //判斷隊列是否爲空 public boolean empty() { return queueList.isEmpty(); } }
網頁連接進出隊列的管理:LinkQueue.java
package com.amos.crawl; import java.util.HashSet; import java.util.Set; /** * Created by amosli on 14-7-9. */ public class LinkQueue { //已經訪問的隊列 private static Set visitedUrl = new HashSet(); //未訪問的隊列 private static Queue unVisitedUrl = new Queue(); //得到URL隊列 public static Queue getUnVisitedUrl() { return unVisitedUrl; } public static Set getVisitedUrl() { return visitedUrl; } //添加到訪問過的URL隊列中 public static void addVisitedUrl(String url) { visitedUrl.add(url); } //刪除已經訪問過的URL public static void removeVisitedUrl(String url){ visitedUrl.remove(url); } //未訪問的URL出隊列 public static Object unVisitedUrlDeQueue(){ return unVisitedUrl.deQueue(); } //保證每一個URL只被訪問一次,url不能爲空,同時已經訪問的URL隊列中不能包含該url,並且由於已經出隊列了所未訪問的URL隊列中也不能包含該url public static void addUnvisitedUrl(String url){ if(url!=null&&!url.trim().equals("")&&!visitedUrl.contains(url)&&!unVisitedUrl.contains(url)) unVisitedUrl.enQueue(url); } //得到已經訪問過的URL的數量 public static int getVisitedUrlNum(){ return visitedUrl.size(); } //判斷未訪問的URL隊列中是否爲空 public static boolean isUnvisitedUrlsEmpty(){ return unVisitedUrl.empty(); } }
抓取思路是:首先給出要抓取的url==>查詢符合條件的url,並將其加入到隊列中==>按順序取出隊列中的url,並訪問之,同時取出符合條件的url==>下載隊列中的url網頁,即按層探索,最多限制100條數據.