Web Scraping with R: How to Fill Missing Value (爬蟲：如何處理缺失值)

時間 2019-11-08

標籤 web scraping missing value 爬蟲如何處理缺失欄目 HTML 简体版

原文原文鏈接

網絡上有大量的信息與數據。咱們能夠利用爬蟲技術來獲取這些巨大的數據資源。html

此次用 IMDb 網站的2018年100部最歡迎的電影來練練手，順便總結一下 R 爬蟲的方法。node

>> Preparation

感謝 Hadley Wickham 大大，咱們有 rvest 包能夠用。所以爬蟲前先安裝並加載 rvest 包。git

#install package
install.package('rvest')
#loading library
library('rvest')

>> Downloading and parsing HTML file

指定網頁地址而且使用 read_html() 函數將網頁轉爲 XML 對象。github

url <- 'https://www.imdb.com/search/title?count=100&release_date=2018-01-01,2018-12-31&view=advanced'
webpage <- read_html(url)

>> Extracting Nodes

我指望獲取的數據包括：
1) Rank: 排名
2) Title：電影名稱
3) Runtime：電影時長
4) Genre：電影類型
5) Rating：觀衆評分
6) Metascore：媒體評分
7) Description：電影簡介
8) Votes：觀衆投票支持的票數
9) Gross：電影票房web

使用 html_nodes() 函數能夠提取 XML 對象中的元素。其中該函數利用 CSS 選擇器來匹配吻合的元素。api

#Using CSS selectors to extract node
rank_data_html <- html_nodes(webpage, '.text-primary')
#Converting the node to text
rank_data <- html_text(rank_data_html)
#Converting text value to numeric value
rank_data <- as.numeric(rank_data)

由於須要利用 CSS 選擇器，因此這個部分或許須要一點 HTML/CSS 的基礎知識。若是不熟悉 HTML/CSS，分享一個小方法：
1) 用瀏覽器（以 Chrome 爲例）打開那個網頁，而後按 F12 打開開發者工具
2) 點擊開發者工具左上角的箭頭去選擇那個須要爬取的數據
3) 對應的那行代碼就會在右側的開發者工具被選中
4) 對着 CSS 選擇器的文檔查查就知道該怎麼寫了瀏覽器

接着用相似的 Script 提取其餘元素的數據。網絡

>> Handling Missing Values

爬取元素後，若是仔細檢查每組元素的長度，就會發現其實某些元素是有缺失值的。好比說 Metascore函數

metascore_data_html <- html_nodes(webpage,'.metascore')
metascore_data <- html_text(metascore_data_html)
length(metascore_data)

怎麼將網頁中不存在的相應值用 NA 表示呢？工具

這裏要了解一下 html_node 和 html_nodes 的區別了。運行 ?html_node 查看幫助文檔：

html_node is like [[ it always extracts exactly one element. When given a list of nodes, html_node will always return a list of the same length, the length of html_nodes might be longer or shorter.

因此簡單地說，就是咱們能夠先提取一組沒有缺失值的父級 DOM，而後從這組 DOM 中逐個提取所需的子級 DOM。

粗暴地說，上代碼：

metascore_data_html <- html_node(html_nodes(webpage, '.lister-item-content'), '.metascore')
metascore_data <- html_text(metascore_data_html)
length(metascore_data)

>> Making a Data Frame

等全部數據都爬取完畢，就能夠將其組合成 data frame 用於後續的分析了。

movies <- data.frame(
  rank = rank_data,
  title = title_data,
  description = description_data,
  runtime = runtime_data,
  genre = genre_data,
  rating = rating_data,
  metascorre = metascore_data,
  votes = votes_data,
  gross = gross_data
)

>> Exporting CSV File

若是不想立刻開始分析工做，還能夠存爲 csv 文件之後用。

write.csv(movies, file = file.choose(new = TRUE), row.names = FALSE)

搞定爬蟲後，如今網絡上已經有不少數據資源等咱們用咯。

>> Notes

rvest 包還有其餘有用的函數能夠發掘一下的：

1) html_tag(): 提取DOM 的 tag name
2) html_attr(): 提取DOM 的一個屬性
3) html_attrs(): 提取DOM 的全部屬性
4) guess_encoding() and repair_encoding()：檢測編碼和修復編碼（爬取中文網頁的時候會用的到的~）
5) jump_to(), follow_link(), back(), forward(): 爬取多頁面網頁的時候或許會用到