網絡爬蟲（java）數據結構—平衡二叉樹

時間 2019-12-14

原文原文鏈接

陸陸續續作了有一個月，期間由於各類技術問題被屢次暫停，最關鍵的一次主要是由於存儲容器使用的普通二叉樹，在節點權重相同的狀況下致使樹高增高，在進行遍歷的時候效率大大下降，甚至在使用遞歸的時候致使棧內存溢出。後來取消遞歸遍歷算法，把普通的二叉排序樹升級爲平衡二叉樹這才解決這些問題。着這個過程當中把棧、隊列、鏈表、HashMap、HashTable各類數據結構都從新學習了一遍，使用紅黑二叉樹實現的TreeMap暫時尚未看，後期須要把TreeMap的實現源碼學習一下。html

爲了把項目作成可擴展性的，方便後期進行升級，又把之前看過的設計模式從新學習了一下，收穫不小，對編程思想以及原則理解更進一步。之前看了屢次也沒有分清楚的簡單工廠模式、工廠方法模式和抽象工廠模式此次也搞清楚了。站在系統結構的高度去考慮系統設計仍是比較有意思的。java

下面就簡單介紹一下網絡爬蟲的實現。node

若是把爬蟲當作一個系統來進行設計，裏面涉及的技術是很是多的，包括：連接解析、連接去重、頁面內容去重（不一樣的URL，但頁面內容同樣）、頁面下載技術、域名高速解析技術……這裏的每一項都可以擴展成一個大的主題，甚至能夠把這些功能單獨列出來作成一個小的項目，進行分佈式部署。初期計劃把這個爬蟲的主要模塊作出來留出可擴展接口，後期慢慢一步一步進行各個模塊的功能完善，儘可能把它作成一個功能比較完善的大型系統。項目已經上傳到GitHub中，對爬蟲有興趣的朋友能夠擴展一下。https://github.com/PerkinsZhu/WebSpridergit

系統的結構比較簡單，能夠看一下類圖：github

具體講解一下各個類的做用：正則表達式

頁面實體：PageEntity，表明一個URL。主要有title、URL、weight三個屬性，用戶可繼承繼續擴展新的屬性。實體必須實現Comparable接口，實現public int compareTo(PageEntity node)方法，在進行存儲的時候會根據此方法獲取兩個節點的比較結果進行存儲。算法

package com.zpj.entity;
/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:43:26
 * @version :1.1
 * 
 */

    public class PageEntity implements Comparable<PageEntity> {

        private String title;
        private String url;
        private float weight;
        private int PRIME = 1000000000;

        public PageEntity(String title, String url, float weight) {
            super();
            this.title = title;
            this.url = url;
            this.weight = weight;
        }

        public String getTitle() {
            return title;
        }

        public void setTitle(String title) {
            this.title = title;
        }

        public String getUrl() {
            return url;
        }

        public void setUrl(String url) {
            this.url = url;
        }

        public float getWeight() {
            return weight;
        }

        public void setWeight(float weight) {
            this.weight = weight;
        }

        /**
         * 比較優先級： weight > title > url
         */
        @Override
        public int compareTo(PageEntity node) {
            if (this.weight == node.weight) {
                if (this.title.equals(node.title)) {
                    if (this.url.equals(node.url)) {
                        return 0;
                    } else {
                        return this.url.compareTo(node.url);
                    }
                } else {
                    return this.title.compareTo(node.title);
                }
            } else {
                return this.weight > node.weight ? 1 : -1;
            }
        }

        //覆寫hashCode
        @Override
        public int hashCode() {
            //若是調用父類的hashCode，則每一個對象的hashcode都不相同，這裏經過url計算hashcode，相同url的hashcode是相同的
            int hash, i;
            for (hash = this.url.length(), i = 0; i < this.url.length(); ++i)
                hash = (hash << 4) ^ (hash >> 28) ^ this.url.charAt(i);
            if (hash < 0) {
                hash = -hash;
            }
            return (hash % PRIME);
        }
        
        
        

}

節點存儲容器：PageCollection。用來存儲實體對象，在程序中主要用來存儲待訪問節點PageEntity和訪問以後節點的hashCode。編程

　　　這裏主要使用的是平衡二叉樹，不是太瞭解的話能夠看一下：數據結構—平衡二叉樹，裏面有詳細的實現講解設計模式

主要方法有三個對外公開：public Node<T> insert(T data)；插入節點網絡

　　　　　　　　　　　　　　public T getMaxNode()；獲取下一個權重最大的節點，同時刪除該節點。

　　　　　　　　　　　　　　public T get(T data) ；查找指定節點

　　　　　　　　　　　　　　public void inorderTraverse()；中序遍歷多有節點

　　　　　　　　　　　　　　public Node<T> remove(T data) ；刪除指定節點。

package com.zpj.collection;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午9:03:35
 * @version :1.1
 * 
 */

public class PageCollection<T extends Comparable<T>> {

    private Node<T> root;

    private static class Node<T> {
        Node<T> left;
        Node<T> right;
        T data;
        int height;

        public Node(Node<T> left, Node<T> right, T data) {
            this.left = left;
            this.right = right;
            this.data = data;
            this.height = 0;
        }
    }

    /**
     * 若是樹中已經存在該節點則不進行插入，若是沒有該節點則進行插入 判斷是否存在該節點的方法是compareTo(T node) == 0。
     * 
     * @param data
     * @return
     */
    public Node<T> insert(T data) {
        return root = insert(data, root);
    }

    private Node<T> insert(T data, Node<T> node) {
        if (node == null)
            return new Node<T>(null, null, data);
        int compareResult = data.compareTo(node.data);
        if (compareResult > 0) {
            node.right = insert(data, node.right);
            if (getHeight(node.right) - getHeight(node.left) == 2) {
                int compareResult02 = data.compareTo(node.right.data);
                if (compareResult02 > 0)
                    node = rotateSingleLeft(node);
                else
                    node = rotateDoubleLeft(node);
            }
        } else if (compareResult < 0) {
            node.left = insert(data, node.left);
            if (getHeight(node.left) - getHeight(node.right) == 2) {
                int intcompareResult02 = data.compareTo(node.left.data);
                if (intcompareResult02 < 0)
                    node = rotateSingleRight(node);
                else
                    node = rotateDoubleRight(node);
            }
        }
        node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
        return node;
    }

    private Node<T> rotateSingleLeft(Node<T> node) {
        Node<T> rightNode = node.right;
        node.right = rightNode.left;
        rightNode.left = node;
        node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
        rightNode.height = Math.max(node.height, getHeight(rightNode.right)) + 1;
        return rightNode;
    }

    private Node<T> rotateSingleRight(Node<T> node) {
        Node<T> leftNode = node.left;
        node.left = leftNode.right;
        leftNode.right = node;
        node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
        leftNode.height = Math.max(getHeight(leftNode.left), node.height) + 1;
        return leftNode;
    }

    private Node<T> rotateDoubleLeft(Node<T> node) {
        node.right = rotateSingleRight(node.right);
        node = rotateSingleLeft(node);
        return node;
    }

    private Node<T> rotateDoubleRight(Node<T> node) {
        node.left = rotateSingleLeft(node.left);
        node = rotateSingleRight(node);
        return node;
    }

    private int getHeight(Node<T> node) {
        return node == null ? -1 : node.height;
    }

    public Node<T> remove(T data) {
        return root = remove(data, root);
    }

    private Node<T> remove(T data, Node<T> node) {
        if (node == null) {
            return null;
        }
        int compareResult = data.compareTo(node.data);
        if (compareResult == 0) {
            if (node.left != null && node.right != null) {
                int balance = getHeight(node.left) - getHeight(node.right);
                Node<T> temp = node;
                if (balance == -1) {
                    exChangeRightData(node, node.right);
                } else {
                    exChangeLeftData(node, node.left);
                }
                temp.height = Math.max(getHeight(temp.left), getHeight(temp.right)) + 1;
                return temp;
            } else {
                return node.left != null ? node.left : node.right;
            }
        } else if (compareResult > 0) {
            node.right = remove(data, node.right);
            node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
            if (getHeight(node.left) - getHeight(node.right) == 2) {
                Node<T> leftSon = node.left;
                if (leftSon.left.height > leftSon.right.height) {
                    node = rotateSingleRight(node);
                } else {
                    node = rotateDoubleRight(node);
                }
            }
            return node;
        } else if (compareResult < 0) {
            node.left = remove(data, node.left);
            node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
            if (getHeight(node.left) - getHeight(node.right) == 2) {
                Node<T> rightSon = node.right;
                if (rightSon.right.height > rightSon.left.height) {
                    node = rotateSingleLeft(node);
                } else {
                    node = rotateDoubleLeft(node);
                }
            }
            return node;
        }
        return null;
    }

    private Node<T> exChangeLeftData(Node<T> node, Node<T> right) {
        if (right.right != null) {
            right.right = exChangeLeftData(node, right.right);
        } else {
            node.data = right.data;
            return right.left;
        }
        right.height = Math.max(getHeight(right.left), getHeight(right.right)) + 1;
        int isbanlance = getHeight(right.left) - getHeight(right.right);
        if (isbanlance == 2) {
            Node<T> leftSon = node.left;
            if (leftSon.left.height > leftSon.right.height) {
                return node = rotateSingleRight(node);
            } else {
                return node = rotateDoubleRight(node);
            }
        }
        return right;
    }

    private Node<T> exChangeRightData(Node<T> node, Node<T> left) {
        if (left.left != null) {
            left.left = exChangeRightData(node, left.left);
        } else {
            node.data = left.data;
            return left.right;
        }
        left.height = Math.max(getHeight(left.left), getHeight(left.right)) + 1;
        int isbanlance = getHeight(left.left) - getHeight(left.right);
        if (isbanlance == -2) {
            Node<T> rightSon = node.right;
            if (rightSon.right.height > rightSon.left.height) {
                return node = rotateSingleLeft(node);
            } else {
                return node = rotateDoubleLeft(node);
            }
        }
        return left;
    }

    public void inorderTraverse() {
        inorderTraverseData(root);
    }

    private void inorderTraverseData(Node<T> node) {
        if (node.left != null) {
            inorderTraverseData(node.left);
        }
        System.out.print(node.data + "、");
        if (node.right != null) {
            inorderTraverseData(node.right);
        }
    }

    public boolean isEmpty() {
        return root == null;
    }

    // 取出最大權重的節點返回 ，同時刪除該節點
    public T getMaxNode() {
        if (root != null)
            root = getMaxNode(root);
        else
            return null;
        return maxNode.data;
    }

    private Node<T> maxNode;

    private Node<T> getMaxNode(Node<T> node) {
        if (node.right == null) {
            maxNode = node;
            return node.left;
        }
        node.right = getMaxNode(node.right);
        node.height = Math.max(getHeight(node.left), getHeight(node.right)) + 1;
        if (getHeight(node.left) - getHeight(node.right) == 2) {
            Node<T> leftSon = node.left;
            if (isDoubleRotate(leftSon.left, leftSon.right)) {
                node = rotateDoubleRight(node);
            } else {
                node = rotateSingleRight(node);
            }
        }
        return node;
    }

    // 根據節點的樹高判斷是否須要進行兩次旋轉 node01是外部節點，node02是內部節點(前提是該節點的祖父節點須要進行旋轉)
    private boolean isDoubleRotate(Node<T> node01, Node<T> node02) {
        // 內部節點不存在，不須要進行兩次旋轉
        if (node02 == null)
            return false;
        // 外部節點等於null，則內部節點樹高必爲2，進行兩側旋轉
        // 外部節點樹高小於內部節點樹高則一定要進行兩次旋轉
        if (node01 == null || node01.height < node02.height)
            return true;
        return false;
    }

    /**
     * 查找指定節點
     * @param data
     * @return
     */
    public T get(T data) {
        if (root == null)
            return null;
        return get(data, root);
    }

    private T get(T data, Node<T> node) {
        if (node == null)
            return null;
        int temp = data.compareTo(node.data);
        if (temp == 0) {
            return node.data;
        } else if (temp > 0) {
            return get(data, node.right);
        } else if (temp < 0) {
            return get(data, node.left);
        }
        return null;
    }

}

爬蟲核心類：WebSprider，該類實現了Runnable，可進行多線程爬取。

該類中分別用PageCollection<PageEntity> collection 和PageCollection<Integer> visitedCollection來存儲待爬取節點和已爬取節點，其中visitedCollection容器存儲的是已爬取節點的hashCode，在pageEntity中覆寫了hashCode（）方法，計算節點hashCode。

在使用的時候客戶端須要繼承該類實現一個子類，而後實現public abstract Object dealPageEntity(PageEntity entity);抽象方法，在該方法中處理爬取的節點，能夠進行輸出、存儲等操做。WebSprider依賴於HtmlParser。

package com.zpj.sprider;

import java.util.List;

import com.zpj.collection.PageCollection;
import com.zpj.entity.PageEntity;
import com.zpj.parser.HtmlParser;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午9:02:39
 * @version :1.1
 * 
 */

public abstract class WebSprider implements Runnable {
    private HtmlParser parser;
    // 存儲待訪問的節點
    private PageCollection<PageEntity> collection = new PageCollection<PageEntity>();
    // 存儲已經訪問過的節點
    private PageCollection<Integer> visitedCollection = new PageCollection<Integer>();

    public WebSprider(HtmlParser parser, String seed) {
        super();
        this.parser = parser;
        collection.insert(new PageEntity("", seed, 1));
    }

    @Override
    public void run() {
        PageEntity entity;
        while (!collection.isEmpty()) {
            entity = collection.getMaxNode();
            dealPageEntity(entity);
            addToCollection(parser.parser(entity));
        }
    }

    private void addToCollection(List<PageEntity> list) {
        for (int i = 0; i < list.size(); i++) {
            PageEntity pe = list.get(i);
            int hashCode = pe.hashCode();
            if (visitedCollection.get(hashCode) == null) {
                collection.insert(pe);
                visitedCollection.insert(hashCode);
            }
        }
    }

    /**
     * 子類對爬取的數據進行處理，能夠對entity進行存儲或者輸出等操做
     * @param entity
     * @return
     */
    public abstract Object dealPageEntity(PageEntity entity);
}

頁面解析類：HtmlParser

　　該類爲抽象類，用戶須要實現其子類並實現 public abstract NodeList parserUrl(Parser parser);和 public abstract boolean matchesUrl(String url);兩個抽象方法。

　　　　public abstract NodeList parserUrl(Parser parser);用來自定義節點過濾器，返回的NodeList就是過濾出的document節點。

　　　　 public abstract boolean matchesUrl(String url);用來校驗URL是不是用戶想要爬取的URL，能夠經過正則表達式實現URL校驗，返回true則加入容器，不然放棄該URL。

該類依賴於WeightStrategy，權重策略

package com.zpj.parser;


import java.net.HttpURLConnection;
import java.net.URL;
import java.util.ArrayList;
import java.util.Iterator;
import java.util.List;
import java.util.Map;
import java.util.Set;

import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;

import com.zpj.entity.PageEntity;
import com.zpj.weightStrategy.WeightStrategy;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:44:39
 * @version :1.1
 * 
 */
    public abstract class HtmlParser {
        private WeightStrategy weightStrategy;
        private List<PageEntity> pageList = new ArrayList<PageEntity>();

        public HtmlParser(WeightStrategy weightStrategy) {
            super();
            this.weightStrategy = weightStrategy;
        }

        NodeList list;

        public List<PageEntity> parser(PageEntity pageEntity) {
            try {
                String entityUrl = pageEntity.getUrl();
                URL getUrl = new URL(entityUrl);
                HttpURLConnection connection = (HttpURLConnection) getUrl.openConnection();
                connection.connect();
                Parser parser;
                parser = new Parser(entityUrl);
                // TODO 自動設置編碼方式
                parser.setEncoding(getCharSet(connection));

                list = parserUrl(parser);
                setDefaultWeight();
            } catch (Exception e) {
                e.printStackTrace();
            }
            if (list == null)
                return pageList;
            for (int i = 0; i < list.size(); i++) {
                String url = list.elementAt(i).getText().substring(8);
                int lastIndex = url.indexOf("\"");
                url = url.substring(0, lastIndex == -1 ? url.length() : lastIndex);
                if (matchesUrl(url)) {
                    String title = list.elementAt(i).toPlainTextString();
                    float weight = weightStrategy.getWeight(title, url);
                    if (weight <= 1) {
                        continue;
                    }
                    pageList.add(new PageEntity(title, url, weight));
                }
            }
            return pageList;
        }

        private void setDefaultWeight() {
            // 解析出body網頁文本內容
            String text = "";
            // 調用權重策略計算本頁面中的全部鏈接的默認權重，每一個頁面的默認權重都要從新計算
            weightStrategy.setDefaultWeightByContent(text);
        }

        /**
         * 子類實現方法，經過過濾器過濾出節點集合
         * @param parser
         * @return
         */
        public abstract NodeList parserUrl(Parser parser);

        /**
         * 子類實現方法，過濾進行存儲的url
         * @param url
         * @return
         */
        public abstract boolean matchesUrl(String url);

        //獲取頁面編碼方式，默認gb2312
        private String getCharSet(HttpURLConnection connection) {
            Map<String, List<String>> map = connection.getHeaderFields();
            Set<String> keys = map.keySet();
            Iterator<String> iterator = keys.iterator();
            // 遍歷,查找字符編碼
            String key = null;
            String tmp = null;
            while (iterator.hasNext()) {
                key = iterator.next();
                tmp = map.get(key).toString().toLowerCase();
                // 獲取content-type charset
                if (key != null && key.equals("Content-Type")) {
                    int m = tmp.indexOf("charset=");
                    if (m != -1) {
                        String charSet =  tmp.substring(m + 8).replace("]", "");
                        return charSet;
                    }
                }
            }
            return "gb2312";
        }
        
        
    }

節點權重計算：WeightStrategy，該類爲抽象類，用戶可自定義權重計算策略，主要實現以下兩個方法：

　　public abstract float countWeight(String title, String url) ;根據title和URL計算出該節點的權重，具體算法由用戶本身定義

　　public abstract void setDefaultWeightByContent(String content);計算出該頁面全部鏈接的基礎權重。

也就是說該頁面的節點的權重=基礎權重（頁面權重）+節點權重。基礎權重是由該頁面的內容分析計算出來，具體算法由public abstract void setDefaultWeightByContent(String content)；方法進行計算，而後在public abstract float countWeight(String title, String url) ;計算出該頁面中的某個節點權重，最終權重由二者之和。

package com.zpj.weightStrategy;
/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午9:04:37
 * @version :1.1
 * 
 */

public abstract class WeightStrategy  {
    protected String keyWord;
    protected float defaultWeight = 1;

    public WeightStrategy(String keyWord) {
        this.keyWord = keyWord;
    }

    public  float getWeight(String title, String url){
        return defaultWeight+countWeight(title,url);
    };
    
    /**
     * 計算鏈接權重，計算結果爲：defaultWeight+該方法返回值
     * @param title
     * @param url
     * @return
     */
    public abstract float countWeight(String title, String url) ;

    /**
     * 根據網頁內容計算該頁面中全部鏈接的默認權重
     * @param content
     */
    public abstract void setDefaultWeightByContent(String content);

    
}

核心代碼就以上五個類，其中兩個爲數據存儲容器，剩下的三個分別是爬蟲抽象類、權重計算抽象類和頁面解析抽象類。用戶使用的時候須要實現這三個抽象類的子類並實現抽象方法。下面給出一個使使用示例：

頁面解析：HtmlParser01

package com.zpj.test;

import java.util.regex.Pattern;

import org.htmlparser.Node;
import org.htmlparser.NodeFilter;
import org.htmlparser.Parser;
import org.htmlparser.util.NodeList;
import org.htmlparser.util.ParserException;

import com.zpj.parser.HtmlParser;
import com.zpj.weightStrategy.WeightStrategy;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:46:39
 * @version :1.1
 * 
 */
public class HtmlParser01 extends HtmlParser {

    public HtmlParser01(WeightStrategy weightStrategy) {
        super(weightStrategy);
    }

    @Override
    public NodeList parserUrl(Parser parser) {
        NodeFilter hrefNodeFilter = new NodeFilter() {
            @Override
            public boolean accept(Node node) {
                if (node.getText().startsWith("a href=")) {
                    return true;
                } else {
                    return false;
                }
            }
        };
        try {
            return parser.extractAllNodesThatMatch(hrefNodeFilter);
        } catch (ParserException e) {
            e.printStackTrace();
        }
        return null;
    }

    @Override
    public boolean matchesUrl(String url) {
        Pattern p = Pattern
                .compile("(http://|ftp://|https://|www){0,1}[^\u4e00-\u9fa5\\s]*?\\.(com|net|cn|me|tw|fr)[^\u4e00-\u9fa5\\s]*");
        return p.matcher(url).matches();
    }

}

權重計算：WeightStrategy01

package com.zpj.test;

import com.zpj.weightStrategy.WeightStrategy;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:34:19
 * @version :1.1
 * 
 */
public class WeightStrategy01 extends WeightStrategy {

    public WeightStrategy01(String keyWord) {
        super(keyWord);
    }

    public float countWeight(String title, String url) {
        int temp = 0;
        while (-1 != title.indexOf(keyWord)) {
            temp++;
            title = title.substring(title.indexOf(keyWord) + keyWord.length());
        }
        return temp * 2;
    }

    @Override
    public void setDefaultWeightByContent(String content) {
        // 解析文本內容計算defaultWeight
        super.defaultWeight = 1;
    }

}

爬蟲主類：MyWebSprider01

package com.zpj.test;

import com.zpj.entity.PageEntity;
import com.zpj.parser.HtmlParser;
import com.zpj.sprider.WebSprider;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:54:39
 * @version :1.1
 * 
 */
public class MyWebSprider01 extends WebSprider {

    public MyWebSprider01(HtmlParser parser, String seed) {
        super(parser, seed);
    }

    @Override
    public Object dealPageEntity(PageEntity entity) {
        System.out.println(entity.getTitle() + "---" + entity.getWeight() + "--" + entity.getUrl());
        return null;
    }

}

測試類：RunThread

package com.zpj.test;

import com.zpj.parser.HtmlParser;
import com.zpj.sprider.WebSprider;
import com.zpj.weightStrategy.WeightStrategy;

/**
 * @author PerKins Zhu
 * @date:2016年9月2日 下午8:34:26
 * @version :1.1
 * 
 */
public class RunThread {

    public static void main(String[] args) {

        WeightStrategy weightStrategy = new WeightStrategy01("中國");

        HtmlParser htmlParser = new HtmlParser01(weightStrategy);

        WebSprider sprider01 = new MyWebSprider01(htmlParser, "http://news.baidu.com/");

        Thread thread01 = new Thread(sprider01);

        thread01.start();

    }
}

程序中須要進行完善的有：存儲容器的存儲效率問題、已爬取節點限制問題（數量最多1000000000，實際遠遠小於這個數字）、URL去重策略、PageEntity的擴展問題、爬取出的節點處理問題等，後面會逐步進行優化，優化以後會提交到https://github.com/PerkinsZhu/WebSprider

對爬蟲有興趣的朋友若是有什麼建議或者程序中有什麼錯誤歡迎留言指出！

-----------------------------------------------------轉載請註明出處！------------------------------------------------------------------------------