Jsoup官方說明裏,一個重要的功能就是***output tidy HTML***。這裏咱們看看Jsoup是如何輸出HTML的。html
分析代碼前,咱們不妨先想一想,"tidy HTML"到底包括哪些東西:java
這裏要補充一下HTML標籤的知識。HTML Tag能夠分爲block和inline兩類。關於Tag的inline和block的定義能夠參考http://www.w3schools.com/html/html_blocks.asp,而Jsoup的Tag
類則是對Java開發者很是好的學習資料。node
<!-- lang: java --> // internal static initialisers: // prepped from http://www.w3.org/TR/REC-html40/sgml/dtd.html and other sources //block tags,須要換行 private static final String[] blockTags = { "html", "head", "body", "frameset", "script", "noscript", "style", "meta", "link", "title", "frame", "noframes", "section", "nav", "aside", "hgroup", "header", "footer", "p", "h1", "h2", "h3", "h4", "h5", "h6", "ul", "ol", "pre", "div", "blockquote", "hr", "address", "figure", "figcaption", "form", "fieldset", "ins", "del", "s", "dl", "dt", "dd", "li", "table", "caption", "thead", "tfoot", "tbody", "colgroup", "col", "tr", "th", "td", "video", "audio", "canvas", "details", "menu", "plaintext" }; //inline tags,無需換行 private static final String[] inlineTags = { "object", "base", "font", "tt", "i", "b", "u", "big", "small", "em", "strong", "dfn", "code", "samp", "kbd", "var", "cite", "abbr", "time", "acronym", "mark", "ruby", "rt", "rp", "a", "img", "br", "wbr", "map", "q", "sub", "sup", "bdo", "iframe", "embed", "span", "input", "select", "textarea", "label", "button", "optgroup", "option", "legend", "datalist", "keygen", "output", "progress", "meter", "area", "param", "source", "track", "summary", "command", "device" }; //emptyTags是不能有內容的標籤,這類標籤都是能夠自閉合的 private static final String[] emptyTags = { "meta", "link", "base", "frame", "img", "br", "wbr", "embed", "hr", "input", "keygen", "col", "command", "device" }; private static final String[] formatAsInlineTags = { "title", "a", "p", "h1", "h2", "h3", "h4", "h5", "h6", "pre", "address", "li", "th", "td", "script", "style", "ins", "del", "s" }; //在這些標籤裏,須要保留空格 private static final String[] preserveWhitespaceTags = { "pre", "plaintext", "title", "textarea" };
另外,Jsoup的Entities
類裏包含了一些HTML實體轉義的東西。這些轉義的對應數據保存在entities-full.properties
和entities-base.properties
裏。canvas
在Jsoup裏,直接調用Document.toString()
(繼承自Element),便可對文檔進行輸出。另外OutputSettings
能夠控制輸出格式,主要是prettyPrint
(是否從新格式化)、outline
(是否強制全部標籤換行)、indentAmount
(縮進長度)等。數組
裏面的繼承和互相調用關係略微複雜,大概是這樣子:ruby
Document.toString()
=>Document.outerHtml()
=>Element.html()
,最終Element.html()
又會循環調用全部子元素的outerHtml()
,拼接起來做爲輸出。app
<!-- lang: java --> private void html(StringBuilder accum) { for (Node node : childNodes) node.outerHtml(accum); }
而outerHtml()
會使用一個OuterHtmlVisitor
對因此子節點作遍歷,並拼裝起來做爲結果。ide
<!-- lang: java --> protected void outerHtml(StringBuilder accum) { new NodeTraversor(new OuterHtmlVisitor(accum, getOutputSettings())).traverse(this); }
OuterHtmlVisitor會對全部子節點作遍歷,並調用node.outerHtmlHead()
和node.outerHtmlTail
兩個方法。學習
<!-- lang: java --> private static class OuterHtmlVisitor implements NodeVisitor { private StringBuilder accum; private Document.OutputSettings out; public void head(Node node, int depth) { node.outerHtmlHead(accum, depth, out); } public void tail(Node node, int depth) { if (!node.nodeName().equals("#text")) // saves a void hit. node.outerHtmlTail(accum, depth, out); } }
咱們終於找到了真正工做的代碼,node.outerHtmlHead()
和node.outerHtmlTail
。Jsoup裏每種Node的輸出方式都不太同樣,這裏只講講兩種主要節點:Element
和TextNode
。Element
是格式化的主要對象,它的兩個方法代碼以下:ui
<!-- lang: java --> void outerHtmlHead(StringBuilder accum, int depth, Document.OutputSettings out) { if (accum.length() > 0 && out.prettyPrint() && (tag.formatAsBlock() || (parent() != null && parent().tag().formatAsBlock()) || out.outline()) ) //換行並調整縮進 indent(accum, depth, out); accum .append("<") .append(tagName()); attributes.html(accum, out); if (childNodes.isEmpty() && tag.isSelfClosing()) accum.append(" />"); else accum.append(">"); } void outerHtmlTail(StringBuilder accum, int depth, Document.OutputSettings out) { if (!(childNodes.isEmpty() && tag.isSelfClosing())) { if (out.prettyPrint() && (!childNodes.isEmpty() && ( tag.formatAsBlock() || (out.outline() && (childNodes.size()>1 || (childNodes.size()==1 && !(childNodes.get(0) instanceof TextNode)))) ))) //換行並調整縮進 indent(accum, depth, out); accum.append("</").append(tagName()).append(">"); } }
而ident方法的代碼只有一行:
<!-- lang: java --> protected void indent(StringBuilder accum, int depth, Document.OutputSettings out) { //out.indentAmount()是縮進長度,默認是1 accum.append("\n").append(StringUtil.padding(depth * out.indentAmount())); }
代碼簡單明瞭,就沒什麼好說的了。值得一提的是,StringUtil.padding()
方法爲了減小字符串生成,把經常使用的縮進保存到了一個數組中。
好了,水了一篇文章,下一篇將比較有技術含量的parser部分。
另外,經過本節的學習,咱們學到了要把StringBuilder命名爲accum,而不是sb。