Jsoup代碼解讀之三-Document的輸出

時間 2019-12-04

原文原文鏈接

Jsoup官方說明裏，一個重要的功能就是***output tidy HTML***。這裏咱們看看Jsoup是如何輸出HTML的。html

HTML相關知識

分析代碼前，咱們不妨先想一想，"tidy HTML"到底包括哪些東西：java

換行，塊級標籤習慣上都會獨佔一行
縮進，根據HTML標籤嵌套層數，行首縮進會不一樣
嚴格的標籤閉合，若是是能夠自閉合的標籤而且沒有內容，則進行自閉合
HTML實體的轉義

這裏要補充一下HTML標籤的知識。HTML Tag能夠分爲block和inline兩類。關於Tag的inline和block的定義能夠參考http://www.w3schools.com/html/html_blocks.asp，而Jsoup的Tag類則是對Java開發者很是好的學習資料。node

<!-- lang: java -->
// internal static initialisers:
// prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sources
//block tags，須要換行
private static final String[] blockTags = {
        "html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame",
        "noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5", "h6",
        "ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins",
        "del", "s", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th",
        "td", "video", "audio", "canvas", "details", "menu", "plaintext"
};
//inline tags，無需換行
private static final String[] inlineTags = {
        "object", "base", "font", "tt", "i", "b", "u", "big", "small", "em", "strong", "dfn", "code", "samp", "kbd",
        "var", "cite", "abbr", "time", "acronym", "mark", "ruby", "rt", "rp", "a", "img", "br", "wbr", "map", "q",
        "sub", "sup", "bdo", "iframe", "embed", "span", "input", "select", "textarea", "label", "button", "optgroup",
        "option", "legend", "datalist", "keygen", "output", "progress", "meter", "area", "param", "source", "track",
        "summary", "command", "device"
};
//emptyTags是不能有內容的標籤，這類標籤都是能夠自閉合的
private static final String[] emptyTags = {
        "meta", "link", "base", "frame", "img", "br", "wbr", "embed", "hr", "input", "keygen", "col", "command",
        "device"
};
private static final String[] formatAsInlineTags = {
        "title", "a", "p", "h1", "h2", "h3", "h4", "h5", "h6", "pre", "address", "li", "th", "td", "script", "style",
        "ins", "del", "s"
};
//在這些標籤裏，須要保留空格
private static final String[] preserveWhitespaceTags = {
        "pre", "plaintext", "title", "textarea"
};

另外，Jsoup的Entities類裏包含了一些HTML實體轉義的東西。這些轉義的對應數據保存在entities-full.properties和entities-base.properties裏。canvas

Jsoup的格式化實現

在Jsoup裏，直接調用Document.toString()(繼承自Element)，便可對文檔進行輸出。另外OutputSettings能夠控制輸出格式，主要是prettyPrint(是否從新格式化)、outline(是否強制全部標籤換行)、indentAmount(縮進長度)等。數組

裏面的繼承和互相調用關係略微複雜，大概是這樣子：ruby

Document.toString()=>Document.outerHtml()=>Element.html()，最終Element.html()又會循環調用全部子元素的outerHtml()，拼接起來做爲輸出。app

<!-- lang: java -->
private void html(StringBuilder accum) {
    for (Node node : childNodes)
        node.outerHtml(accum);
}

而outerHtml()會使用一個OuterHtmlVisitor對因此子節點作遍歷，並拼裝起來做爲結果。ide

<!-- lang: java -->
protected void outerHtml(StringBuilder accum) {
    new NodeTraversor(new OuterHtmlVisitor(accum, getOutputSettings())).traverse(this);
}

OuterHtmlVisitor會對全部子節點作遍歷，並調用node.outerHtmlHead()和node.outerHtmlTail兩個方法。學習

<!-- lang: java -->
private static class OuterHtmlVisitor implements NodeVisitor {
    private StringBuilder accum;
    private Document.OutputSettings out;

    public void head(Node node, int depth) {
        node.outerHtmlHead(accum, depth, out);
    }

    public void tail(Node node, int depth) {
        if (!node.nodeName().equals("#text")) // saves a void hit.
            node.outerHtmlTail(accum, depth, out);
    }
}

咱們終於找到了真正工做的代碼，node.outerHtmlHead()和node.outerHtmlTail。Jsoup裏每種Node的輸出方式都不太同樣，這裏只講講兩種主要節點：Element和TextNode。Element是格式化的主要對象，它的兩個方法代碼以下：ui

<!-- lang: java -->
void outerHtmlHead(StringBuilder accum, int depth, Document.OutputSettings out) {
    if (accum.length() > 0 && out.prettyPrint()
            && (tag.formatAsBlock() || (parent() != null && parent().tag().formatAsBlock()) || out.outline()) )
        //換行並調整縮進
        indent(accum, depth, out);
    accum
            .append("<")
            .append(tagName());
    attributes.html(accum, out);

    if (childNodes.isEmpty() && tag.isSelfClosing())
        accum.append(" />");
    else
        accum.append(">");
}

void outerHtmlTail(StringBuilder accum, int depth, Document.OutputSettings out) {
    if (!(childNodes.isEmpty() && tag.isSelfClosing())) {
        if (out.prettyPrint() && (!childNodes.isEmpty() && (
                tag.formatAsBlock() || (out.outline() && (childNodes.size()>1 || (childNodes.size()==1 && !(childNodes.get(0) instanceof TextNode))))
        )))
            //換行並調整縮進
            indent(accum, depth, out);
        accum.append("</").append(tagName()).append(">");
    }
}

而ident方法的代碼只有一行：

<!-- lang: java -->
protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) {
    //out.indentAmount()是縮進長度，默認是1
    accum.append("\n").append(StringUtil.padding(depth * out.indentAmount()));
}

代碼簡單明瞭，就沒什麼好說的了。值得一提的是，StringUtil.padding()方法爲了減小字符串生成，把經常使用的縮進保存到了一個數組中。

好了，水了一篇文章，下一篇將比較有技術含量的parser部分。

另外，經過本節的學習，咱們學到了要把StringBuilder命名爲accum，而不是sb。