Jsoup 源碼分析

時間 2019-12-05

標籤 jsoup 源碼分析欄目 Java 简体版

原文原文鏈接

1、Jsoup 在類的設計上許多名字採用和 java 自己自帶的 xml 解析器的類的名字同樣，這樣很容易讓人誤會，Jsoup 在設計過程當中沿用了 xml 解析器，其實這種觀念是錯誤的，jsoup 之因此這麼出色，是由於，Jsoup使用了一套本身的DOM對象體系，和Java XML API互不兼容。這樣作的好處是從XML的API裏解脫出來，使得代碼精煉了不少。這篇文章會說明Jsoup的DOM結構，DOM的遍歷方式。java

2、先來看一下 Jsoup 是如何設計 Dom 樹 node

能夠經過下面的類圖對 Jsoup Dom 類的設計有一個直觀的瞭解：程序員

能夠看到在整個 Dom樹的設計中，Node 類的地位是很是重要的，下面我分析一下 Node 類，這是 Node 類的聲明部分： public abstract class Node implements Cloneable 這是 Node 類的屬性：算法

<!-- lang: java -->
Node parentNode; //Node 類指向父節點的引用
List<Node> childNodes; //用一個 List 保存對全部 子節點的引用
Attributes attributes; //對自身屬性對象 Attributes 的引用 （Attribute 只是對 LinkedHashMap 的一個包裝）
String baseUri; //Node 類指向父節點的引用 自己網頁的 URL 地址，主要是用於將相對地址變爲絕對地址
int siblingIndex; //在兄弟節點中的位置 由父節點的 childNodes 計算獲得

Node類實現了一顆 DOM 樹中一個節點應該有的全部方法，其中有幾個方法值得研究：學習

第一個方法：public String attr(String attributeKey) 這個方法的做用是：根據屬性名，獲取屬性值對應的 value 好比 class，獲得 class 屬性的屬性值還有一個很是有用的做用是：他能夠獲取絕對地址，只要在須要獲取絕對地址的屬性名前面加上「abs:屬性名」下面貼上源代碼：ui

<!-- lang: java -->
public String attr(String attributeKey) {
    Validate.notNull(attributeKey);

    if (attributes.hasKey(attributeKey))
        return attributes.get(attributeKey);
    //若是attributeKey 以 abs: 開頭，不分大小寫則獲取 絕對地址返回
    else if (attributeKey.toLowerCase().startsWith("abs:"))
        return absUrl(attributeKey.substring("abs:".length()));
    else return "";
}

下面貼上有用的 public String absUrl(String attributeKey) 方法：url

<!-- lang: java -->
public String absUrl(String attributeKey) {
    Validate.notEmpty(attributeKey);
    //先獲取屬性值，即有多是相對地址
    String relUrl = attr(attributeKey);
    //沒找到屬性值，則返回空值
    if (!hasAttr(attributeKey)) {
        return ""; // nothing to make absolute with
    } else {
        URL base;
        try {
            try {
                //判斷一下用戶給的 uri 是不是正常的
                base = new URL(baseUri);
            } catch (MalformedURLException e) {
                // the base is unsuitable, but the attribute may be abs on its own, so try that
                URL abs = new URL(relUrl);
                return abs.toExternalForm();
            }
            // workaround: java resolves '//path/file + ?foo' to '//path/?foo', not '//path/file?foo' as desired
            if (relUrl.startsWith("?"))
                relUrl = base.getPath() + relUrl;
            URL abs = new URL(base, relUrl);
            return abs.toExternalForm();
        } catch (MalformedURLException e) {
            return "";
        }
    }
}

很是容易理解，之前是本身實現將相對地址變爲絕對地址，之前的代碼也貼上：設計

//給定相對地址，給定基本地址，返回絕對地址code

<!-- lang: java -->
private  String absURL(String link, String url) {
	String returnLink = "";
	if(link.indexOf("http") != 0) {
		String[] cutUrl = url.split("\\/");
		int len = cutUrl.length;
		if(link.indexOf("../") == 0) {
			String[] tempLinks = link.split("\\/");
			int j = 1;
			link = "";
			for (; j < tempLinks.length-1; j++) {
				link = link + tempLinks[j] + "/";
			}
			link += tempLinks[j];
			len = len -1;
			for (int i=0; i<len-1; i++) {
				returnLink += cutUrl[i]+"/";
			}
			returnLink += link;
		}else {
			if(link.indexOf("/") == 0 || link.indexOf("./") == 0 ) {
				String[] tempLinks = link.split("\\/");
				int j = 1;
				for (; j < tempLinks.length-1; j++) {
					link = tempLinks[j] + "/";
				}
				link += tempLinks[j];
			}
			for (int i=0; i<len-1; i++) {
				returnLink += cutUrl[i]+"/";
			}
			returnLink += link;
		}
	} else 
                    returnLink = link;
	return returnLink;
 }

呵呵，本身當初代碼寫的很粗糙，協議只判斷了 http, 並且字符串處理功底不深，和別人的比起來更是小巫見大巫，增強學習和思考很重要orm

第二個方法：就是樹的遍歷，traverse(NodeVisitor nodeVisitor) 下面是 Jsoup 用循環的方式實現的深度優先遍歷

<!-- lang: java -->
public void traverse(Node root) {
    Node node = root;
    int depth = 0; 
    while (node != null) {
        visitor.head(node, depth);
        if (node.childNodeSize() > 0) {
            node = node.childNode(0);
            depth++;
        } else {
            while (node.nextSibling() == null && depth > 0) {
                visitor.tail(node, depth);
                node = node.parent();
                depth--;
            }
            visitor.tail(node, depth);
            if (node == root)
                break;
            node = node.nextSibling();
        }
    }

}

我本身實現的遞歸深度優先遍歷，遞歸的好處在於容易理解，可是對內存要求大：

<!-- lang: java -->
public static void depththFirstVisitor(Node root, int depth) {
	depth++;
	List<Node> childs = root.childNodes();
	if(childs == null || childs.size() == 0) return;
	for(Node curNode : childs) {
		visit(curNode, depth);
		depththFirstVisitor(curNode, depth);
	}
 }

下面是用隊列實現的廣度優先遍歷:

<!-- lang: java -->
private static void breadthFirstVisitor(Node node, int depth) {
	Queue<Node> nodeQueue = new LinkedList<Node>();
	nodeQueue.offer(node);
	while(nodeQueue.size() != 0) {
		
		Node curNode = nodeQueue.poll();
		visit(curNode, 0);
		List<Node> childs = curNode.childNodes();
		for(Node child : childs) {
			nodeQueue.offer(child);
		}
	}
}

這些算法是最基本的程序員的要求，要增強訓練