To parse a HTML document:html
String html = "<html><head><title>First parse</title></head>" + "<body><p>Parsed HTML into a doc.</p></body></html>"; Document doc = Jsoup.parse(html);
(See parsing a document from a string for more info.)java
The parser will make every attempt to create a clean parse from the HTML you provide, regardless of whether the HTML is well-formed or not. It handles:node
unclosed tags (e.g. <p>Lorem <p>Ipsum
parses to <p>Lorem</p> <p>Ipsum</p>
)api
implicit tags (e.g. a naked <td>Table data</td>
is wrapped into a <table><tr><td>?
)app
reliably creating the document structure (html
containing a head
and body
, and only appropriate elements within the head)less
Documents consist of Elements and TextNodes (and a couple of other misc nodes: see thenodes package tree).dom
The inheritance chain is: Document
extends Element
extends Node
. TextNode
extends Node
.ide
An Element contains a list of children Nodes, and has one parent Element. They also have provide a filtered list of child Elements only.spa
Extracting data: DOM navigationcode
Extracting data: Selector syntax