python爬蟲筆記：phantomjs+selenium採集內容

時間 2019-11-13

標籤 python 爬蟲筆記 phantomjs+selenium phantomjs selenium 採集內容欄目 Python 简体版

原文原文鏈接

對於通常的網站而言，利用python的beautifulsoup均可以爬取，但面對一些須要執行頁面上的JavaScript才能爬取的網站，就能夠採用phantomjs+selenium的方法爬取數據。我在學習時，也遇到了這類問題，所以聊以記之。javascript

我用的案例網站是中國天氣網（http://www.weather.com.cn/weather40d/101020100.shtml）。html

我想爬取的是上海的40每天氣裏的每一天的最高氣溫數據。所以，首先我使用通常的方法爬取：java

from bs4 import BeautifulSoup
from urllib.request import urlopen
html = urlopen('http://www.weather.com.cn/weather40d/101020100.shtml')
html_parse = BeautifulSoup(html)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

可是卻發現print(temp)輸出的只是標籤：[, ...... ]python

所以我判斷數據必需要在javascript執行後才能獲取，因而，我採用了phantomjs+selenium的方式獲取這一類數據，代碼以下：web

from bs4 import BeautifulSoup
from selenium import webdriver
import time

driver = webdriver.PhantomJS(executable_path='F:\\python\\phantomjs-2.1.1-windows\\phantomjs-2.1.1-windows\\bin\\phantomjs.exe')
driver.get("http://www.weather.com.cn/weather40d/101020100.shtml")
time.sleep(3)
pageSource = driver.page_source
html_parse = BeautifulSoup(pageSource)
temp = html_parse.findAll("span",{"class":"max"})
print(temp)

這段代碼建立了一個新的selenium WebDriver，首先用WebDriver加載頁面，所以咱們給它3秒鐘時間（time.sleep(3)），以後，因爲我我的比較喜歡用beautifulsoup，而WebDriver的page_source函數能夠返回頁面的源代碼字符串，所以我用了第8,9行代碼來回歸到用咱們所熟悉的Beautifulsoup來解析頁面內容。這個程序的最後運行結果是：[9, 9...... 12, 12, , , , , , , ],數據基本上就能夠被獲取了。windows

雖然這個例子比較簡單，可是所謂萬變不離其宗，其基本思想即是這些了，更高深的技術就須要咱們繼續學習了。函數

若文中有錯誤不妥之處，歡迎指出，共同窗習，一塊兒進步。學習