例:抓取下面標籤的內容:css
<h3 class="lister index unbold text"><span>小明他很忙</span></h3>
(1)使用xpath(與python裏使用xpath 類似,R中能夠使用html_text() 獲取標籤中的內容,如"<span>小明他很忙</span>"中標籤內容爲「小明他很忙」; 使用html_att("屬性") 獲取屬性值):html
rvest::html_nodes(webPage, xpath = '//h3[@class="lister index unbold text"]/span') %>% rvest::html_text()
(2)使用css選擇器
使用以前,咱們首先要了解一下幾點內容:
1.在css中 "class" 用 "." 映射; "id" 用 "#" 映射
2.在css選擇器中,若是class裏帶的空格,用.來代替空格
h3 class="lister index unbold text" -> h3.lister index unbold text(class裏有空格) -> h3.lister.index.unbold.textnode
rvest::html_nodes(webPage, css = "h3.lister.index.unbold.text span") %>% rvest::html_text()
library(pacman) pacman::p_load(rvest, xml2)
# 載入工具包 library(rvest) library(xml2)
# 設置爬取的網址 url <- "https://www.imdb.com/search/title?count=100&release_date=2016,2016&title_type=feature" # 獲取頁面內容(頁面源碼) webPage <- xml2::read_html(x = url, encoding = "UTF-8") # ======= 方法1 使用xpath ========== # 電影名稱 movieName <- rvest::html_nodes(webPage, xpath = '//h3[@class="lister-item-header"]/a/text()') # === 備註 === # 若是用到屬性裏的值,使用函數rvest::html_att(),如rvest::html_att("alt") # rvest::html_nodes(webPage, xpath = '//div[@class="lister-item-image float-left"]/a/img') %>% rvest::html_attr("alt") # 上映年份 year <- rvest::html_nodes(webPage, xpath = '//span[@class="lister-item-year text-muted unbold"]/text()') # ======= 方法2 使用css選擇擇器 ===== # 電影排序 movieRank <- rvest::html_nodes(webPage, css = "span.lister-item-year.text-muted.unbold") %>% rvest::html_text()