在前面的內容中,咱們掌握了一些 CSS 選擇器和它們的使用方法,以及 rvest 包中
用於提取網頁內容的函數。
通常來講,CSS 選擇器足夠知足絕大部分
的 HTML 節點匹配的須要。可是,當須要根據某些
特殊條件選擇節點時,須要用更強大的技術。
圖 14-5 所示的網頁比 data/products.html 複雜
一點:
這個網頁做爲一個獨立的 HTML 文件被存儲
在 data/new-products.html。所有的源代碼很長,這
裏只展現 <body> 部分。請瀏覽一遍源代碼,以便
對它的結構有個印象:
<body>
<h1>New Products</h1>html
圖 14-5
<p>The following is a list of products</p>
<div id = "list" class = "product-list">
<ul>
<li>
<span class = "name">Product-A</span>
<span class = "price">$199.95</span>
<div class = "info bordered">
<p>Description for Product-A</p>
<ul>
<li><span class = "info-key">Quality</span> <span class =
"infovalue">Good</span></li>
<li><span class = "info-key">Duration</span> <span class =
"infovalue">5 </span><span class = "unit">years</span></li>
</ul>
</div>
</li>
<li class = "selected">
<span class = "name">Product-B</span>
<span class = "price">$129.95</span>
<div class = "info">
<p>Description for Product-B</p>
<ul>
<li><span class = "info-key">Quality</span> <span class = "infovalue">
Medium</span></li>
<li><span class = "info-key">Duration</span> <span class = "infovalue">
2</span><span class = "unit">years</span></li>
</ul>
</div>
</li>
<li>
<span class = "name">Product-C</span>
<span class = "price">$99.95</span>
<div class = "info">
<p>Description for Product-C</p>
<ul>
<li><span class = "info-key">Quality</span> <span class = "infovalue">
Good</span></li>
<li><span class = "info-key">Duration</span> <span class = "infovalue">
4</span><span class = "unit">years</span></li>
</ul>
</div>
</li>
</ul>
</div>
<p>All products are available for sale!</p>
</body>
網頁的源代碼包含了一個樣式表和產品詳細信息的列表。每一個產品都有其描述和不少
性質。接下來,就像前面的例子同樣,咱們載入網頁:
page <- read_ _html("data/new-products.html")
HTML 的代碼結構簡單明晰。在深刻挖掘 XPath 以前,咱們須要瞭解一下 XML。編寫
良好且組織規範的 HTML 文檔能夠被看做 XML(eXtensive Markup Language)文檔的一個
特例。與 HTML 不一樣,XML 容許任意的標籤和屬性。下面是一個簡單的 XML 文檔示例:
<?xml version = "1.0"?>
<root>
<product id = "1">
<name>Product-A<name>
<price>$199.95</price>
</product>
<product id = "2">
<name>Product-B</name>
<price>$129.95</price>
</product>
</root>
XPath 專門用於提取 XML 文檔中的數據。在本節中,咱們比較 XPath 表達式和 CSS 選
擇器,查看兩者在提取網頁數據過程當中的做用。
函數 html_node( ) 和 html_nodes( ) 支持 XPath 表達式,並經過參數 xpath= 實
現。表 14-2 展現了 CSS 選擇器和等價的 XPath 表達式之間的一些重要對比。
表 14-2
CSS XPath Math
li > * //li/* All children of <li>
li[attr] //li[@attr] All <li> with attr attribute
li[attr=value] //li[@attr = 'value'] <li attr = "value">
li#item //li[@id = 'item'] <li id = "item">
li.info //li[contains(@class,'info')] <li class = "info">
續表
CSS XPath Math
li:first-child //li[1] First <li>
li:last-child //li[last()] Last <li>
li:nth-child(n) //li[n] n th <li>
(N/A) //p[a] All <p> with a child <a>
(N/A) //p[position() <= 5] The first five <p> nodes
(N/A) //p[last()-2] The last third last <p>
(N/A) //li[value>0.5] All <li> with child <value>whose value > 0.5
CSS 選擇器會匹配全部子層級的節點。在 XPath 表達式中,標籤 // 和 / 匹配不一樣的
節點。更具體地說,// 標籤引用全部子層級的 <tag> 節點,而 / 標籤只引用第 1 個子層級
的 <tag> 節點。
咱們經過下面這些例子展現它們的用法:
選擇全部 <p> 節點:
page %>% html_ _nodes(xpath = "//p")
## {xml_nodeset (5)}
## [1] <p>The following is a list of products</p>
## [2] <p>Description for Product-A</p>
## [3] <p>Description for Product-B</p>
## [4] <p>Description for Product-C</p>
## [5] <p>All products are available for sale!</p>
選擇全部具備 class 屬性的 <li> 節點:
page %>% html_ _nodes(xpath = "//li[@class]")
## {xml_nodeset (1)}
## [1] <li class = "selected">\n <span class = "name">Pro ...
選擇 <div id = "list"><ul> 節點中全部 <li> 子節點:
page %>% html_ _nodes(xpath = "//div[@id = 'list']/ul/li")
## {xml_nodeset (3)}
## [1] <li>\n <span class = "name">Product-A</span>\n ...
## [2] <li class = "selected">\n <span class = "name">Pro ...
## [3] <li>\n <span class = "name">Product-C</span>\n ...
選擇全部嵌套於<div id = "list"> 中 <li> 標籤下的 <span class = "name"> 子
節點:
page %>% html_ _nodes(xpath = "//div[@id = 'list']//li/span[@class = 'name']")
## {xml_nodeset (3)}
## [1] <span class = "name">Product-A</span>
## [2] <span class = "name">Product-B</span>
## [3] <span class = "name">Product-C</span>
選擇全部嵌套於 <li class = "selected"> 中的 <span class = "name"> 子節點:
page %>%
html_ _nodes(xpath = "//li[@class = 'selected']/span[@class = 'name']")
## {xml_nodeset (1)}
## [1] <span class = "name">Product-B</span>
上面這些例子也可使用等效的 CSS 選擇器來實現。然而,下面這些例子就不能
用 CSS 選擇器實現了:
選擇全部包含 <p> 子節點的 <div> 節點:
page %>% html_ _nodes(xpath = "//div[p]")
## {xml_nodeset (3)}
## [1] <div class = "info bordered">\n <p>Description ...
## [2] <div class = "info">\n <p>Description for Prod ...
## [3] <div class = "info">\n <p>Description for Prod ...
選擇全部的 <span class = "info-value">Good</span>:
page %>%
html_ _nodes(xpath = "//span[@class = 'info-value' and text() = 'Good']")
## {xml_nodeset (2)}
## [1] <span class = "info-value">Good</span>
## [2] <span class = "info-value">Good</span>
選擇全部優質產品的名稱:
page %>%
html_ _nodes(xpath = "//li[div/ul/li[1]/span[@class = 'info-value' and
text() = 'Good']]/span[@class = 'name']")
## {xml_nodeset (2)}
## [1] <span class = "name">Product-A</span>
## [2] <span class = "name">Product-C</span>
選擇全部持續時間超過 3 年的產品名稱:
page %>%
html_ _nodes(xpath = "//li[div/ul/li[2]/span[@class = 'info-value' and
text()>3]]/span[@class = 'name']")
## {xml_nodeset (2)}
## [1] <span class = "name">Product-A</span>
## [2] <span class = "name">Product-C</span>
XPath 是很是靈活的,在匹配網頁節點方面是一個強大的工具。想要了解更多內容,
請訪問 http://www.w3schools.com/xsl/xpath_syntax.aspac。node