Python爬蟲利器Selenium的用法

時間 2019-11-08

原文原文鏈接

轉自https://www.cnblogs.com/BigFishFly/p/6380024.htmlcss

前言

在上一節咱們學習了 PhantomJS 的基本用法，歸根結底它是一個沒有界面的瀏覽器，並且運行的是 JavaScript 腳本，然而這就能寫爬蟲了嗎？這又和Python有什麼關係？說好的Python爬蟲呢？庫都學完了你給我看這個？客官別急，接下來咱們介紹的這個工具，通通解決掉你的疑惑。html

簡介

Selenium 是什麼？一句話，自動化測試工具。它支持各類瀏覽器，包括 Chrome，Safari，Firefox 等主流界面式瀏覽器，若是你在這些瀏覽器裏面安裝一個 Selenium 的插件，那麼即可以方便地實現Web界面的測試。換句話說叫 Selenium 支持這些瀏覽器驅動。話說回來，PhantomJS不也是一個瀏覽器嗎，那麼 Selenium 支持不？答案是確定的，這樣兩者即可以實現無縫對接了。python

而後又有什麼好消息呢？Selenium支持多種語言開發，好比 Java，C，Ruby等等，有 Python 嗎？那是必須的！哦這可真是天大的好消息啊。git

嗯，因此呢？安裝一下 Python 的 Selenium 庫，再安裝好 PhantomJS，不就能夠實現 Python＋Selenium＋PhantomJS 的無縫對接了嘛！PhantomJS 用來渲染解析JS，Selenium 用來驅動以及與 Python 的對接，Python 進行後期的處理，完美的三劍客！github

有人問，爲何不直接用瀏覽器而用一個沒界面的 PhantomJS 呢？答案是：效率高！web

Selenium 有兩個版本，目前最新版本是 2.53.1（2016/3/22）chrome

Selenium 2，又名 WebDriver，它的主要新功能是集成了 Selenium 1.0 以及 WebDriver（WebDriver 曾經是 Selenium 的競爭對手）。也就是說 Selenium 2 是 Selenium 和 WebDriver 兩個項目的合併，即 Selenium 2 兼容 Selenium，它既支持 Selenium API 也支持 WebDriver API。api

更多詳情能夠查看 Webdriver 的簡介。瀏覽器

Webdrivercookie

嗯，經過以上描述，咱們應該對 Selenium 有了大概對認識，接下來就讓咱們開始進入動態爬取的新世界吧。

本文參考內容來自

Selenium官網 SeleniumPython文檔

安裝

首先安裝 Selenium

1	pip install selenium

或者下載源碼

下載源碼

而後解壓後運行下面的命令進行安裝

1	python setup.py install

安裝好了以後咱們便開始探索抓取方法了。

快速開始

初步體驗

咱們先來一個小例子感覺一下 Selenium，這裏咱們用 Chrome 瀏覽器來測試，方便查看效果，到真正爬取的時候換回 PhantomJS 便可。

from selenium import webdriver

browser = webdriver.Chrome()

browser.get('http://www.baidu.com/')

運行這段代碼，會自動打開瀏覽器，而後訪問百度。

若是程序執行錯誤，瀏覽器沒有打開，那麼應該是沒有裝 Chrome 瀏覽器或者 Chrome 驅動沒有配置在環境變量裏。下載驅動，而後將驅動文件路徑配置在環境變量便可。

瀏覽器驅動下載

好比個人是 Mac OS，就把下載好的文件放在 /usr/bin 目錄下就能夠了。

模擬提交

下面的代碼實現了模擬提交提交搜索的功能，首先等頁面加載完成，而後輸入到搜索框文本，點擊提交。

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

driver = webdriver.Chrome()

driver.get("http://www.python.org")

assert "Python" in driver.title

elem = driver.find_element_by_name("q")

elem.send_keys("pycon")

elem.send_keys(Keys.RETURN)

print driver.page_source

一樣是在 Chrome 裏面測試，感覺一下。

The driver.get method will navigate to a page given by the URL. WebDriver will wait until the page has fully loaded (that is, the 「onload」 event has fired) before returning control to your test or script. It’s worth noting that if your page uses a lot of AJAX on load then WebDriver may not know when it has completely loaded.

其中 driver.get 方法會打開請求的URL，WebDriver 會等待頁面徹底加載完成以後纔會返回，即程序會等待頁面的全部內容加載完成，JS渲染完畢以後才繼續往下執行。注意：若是這裏用到了特別多的 Ajax 的話，程序可能不知道是否已經徹底加載完畢。

WebDriver offers a number of ways to find elements using one of the find_element_by_* methods. For example, the input text element can be located by its name attribute using find_element_by_name method

WebDriver 提供了許多尋找網頁元素的方法，譬如 find_element_by_* 的方法。例如一個輸入框能夠經過 find_element_by_name 方法尋找 name 屬性來肯定。

Next we are sending keys, this is similar to entering keys using your keyboard. Special keys can be send using Keys class imported from selenium.webdriver.common.keys

而後咱們輸入來文本而後模擬點擊了回車，就像咱們敲擊鍵盤同樣。咱們能夠利用 Keys 這個類來模擬鍵盤輸入。

最後最重要的一點

獲取網頁渲染後的源代碼。

輸出 page_source 屬性便可。

這樣，咱們就能夠作到網頁的動態爬取了。

測試用例

有了以上特性，咱們固然能夠用來寫測試樣例了。

import unittest

from selenium import webdriver

from selenium.webdriver.common.keys import Keys

class PythonOrgSearch(unittest.TestCase):

def setUp(self):

self.driver = webdriver.Chrome()

def test_search_in_python_org(self):

driver = self.driver

driver.get("http://www.python.org")

self.assertIn("Python", driver.title)

elem = driver.find_element_by_name("q")

elem.send_keys("pycon")

elem.send_keys(Keys.RETURN)

assert "No results found." not in driver.page_source

def tearDown(self):

self.driver.close()

if __name__ == "__main__":

unittest.main()

運行程序，一樣的功能，咱們將其封裝爲測試標準類的形式。

The test case class is inherited from unittest.TestCase. Inheriting from TestCase class is the way to tell unittest module that this is a test case. The setUp is part of initialization, this method will get called before every test function which you are going to write in this test case class. The test case method should always start with characters test. The tearDown method will get called after every test method. This is a place to do all cleanup actions. You can also call quit method instead of close. The quit will exit the entire browser, whereas close will close a tab, but if it is the only tab opened, by default most browser will exit entirely.

測試用例是繼承了 unittest.TestCase 類，繼承這個類代表這是一個測試類。setUp方法是初始化的方法，這個方法會在每一個測試類中自動調用。每個測試方法命名都有規範，必須以 test 開頭，會自動執行。最後的 tearDown 方法會在每個測試方法結束以後調用。這至關於最後的析構方法。在這個方法裏寫的是 close 方法，你還能夠寫 quit 方法。不過 close 方法至關於關閉了這個 TAB 選項卡，然而 quit 是退出了整個瀏覽器。當你只開啓了一個 TAB 選項卡的時候，關閉的時候也會將整個瀏覽器關閉。

頁面操做

頁面交互

僅僅抓取頁面沒有多大卵用，咱們真正要作的是作到和頁面交互，好比點擊，輸入等等。那麼前提就是要找到頁面中的元素。WebDriver提供了各類方法來尋找元素。例以下面有一個表單輸入框。

1	<input type="text" name="passwd" id="passwd-id" />

咱們能夠這樣獲取它

element = driver.find_element_by_id("passwd-id")

element = driver.find_element_by_name("passwd")

element = driver.find_elements_by_tag_name("input")

element = driver.find_element_by_xpath("//input[@id='passwd-id']")

你還能夠經過它的文本連接來獲取，可是要當心，文本必須徹底匹配才能夠，因此這並非一個很好的匹配方式。

並且你在用 xpath 的時候還須要注意的是，若是有多個元素匹配了 xpath，它只會返回第一個匹配的元素。若是沒有找到，那麼會拋出 NoSuchElementException 的異常。

獲取了元素以後，下一步固然就是向文本輸入內容了，能夠利用下面的方法

1	element.send_keys("some text")

一樣你還能夠利用 Keys 這個類來模擬點擊某個按鍵。

1	element.send_keys("and some", Keys.ARROW_DOWN)

你能夠對任何獲取到到元素使用 send_keys 方法，就像你在 GMail 裏面點擊發送鍵同樣。不過這樣會致使的結果就是輸入的文本不會自動清除。因此輸入的文本都會在原來的基礎上繼續輸入。你能夠用下面的方法來清除輸入文本的內容。

1	element.clear()

這樣輸入的文本會被清除。

填充表單

咱們已經知道了怎樣向文本框中輸入文字，可是其它的表單元素呢？例以下拉選項卡的的處理能夠以下

element = driver.find_element_by_xpath("//select[@name='name']")

all_options = element.find_elements_by_tag_name("option")

for option in all_options:

print("Value is: %s" % option.get_attribute("value"))

option.click()

首先獲取了第一個 select 元素，也就是下拉選項卡。而後輪流設置了 select 選項卡中的每個 option 選項。你能夠看到，這並非一個很是有效的方法。

其實 WebDriver 中提供了一個叫 Select 的方法，能夠幫助咱們完成這些事情。

from selenium.webdriver.support.ui import Select

select = Select(driver.find_element_by_name('name'))

select.select_by_index(index)

select.select_by_visible_text("text")

select.select_by_value(value)

如你所見，它能夠根據索引來選擇，能夠根據值來選擇，能夠根據文字來選擇。是十分方便的。

所有取消選擇怎麼辦呢？很簡單

1 2	select = Select(driver.find_element_by_id('id')) select.deselect_all()

這樣即可以取消全部的選擇。

另外咱們還能夠經過下面的方法獲取全部的已選選項。

1 2	select = Select(driver.find_element_by_xpath("xpath")) all_selected_options = select.all_selected_options

獲取全部可選選項是

1	options = select.options

若是你把表單都填好了，最後確定要提交表單對吧。怎嗎提交呢？很簡單

1	driver.find_element_by_id("submit").click()

這樣就至關於模擬點擊了 submit 按鈕，作到表單提交。

固然你也能夠單獨提交某個元素

1	element.submit()

方法，WebDriver 會在表單中尋找它所在的表單，若是發現這個元素並無被表單所包圍，那麼程序會拋出 NoSuchElementException 的異常。

元素拖拽

要完成元素的拖拽，首先你須要指定被拖動的元素和拖動目標元素，而後利用 ActionChains 類來實現。

element = driver.find_element_by_name("source")

target = driver.find_element_by_name("target")

from selenium.webdriver import ActionChains

action_chains = ActionChains(driver)

action_chains.drag_and_drop(element, target).perform()

這樣就實現了元素從 source 拖動到 target 的操做。

頁面切換

一個瀏覽器確定會有不少窗口，因此咱們確定要有方法來實現窗口的切換。切換窗口的方法以下

1	driver.switch_to_window("windowName")

另外你可使用 window_handles 方法來獲取每一個窗口的操做對象。例如

1 2	for handle in driver.window_handles: driver.switch_to_window(handle)

另外切換 frame 的方法以下

1	driver.switch_to_frame("frameName.0.child")

這樣焦點會切換到一個 name 爲 child 的 frame 上。

彈窗處理

當你出發了某個事件以後，頁面出現了彈窗提示，那麼你怎樣來處理這個提示或者獲取提示信息呢？

1	alert = driver.switch_to_alert()

經過上述方法能夠獲取彈窗對象。

歷史記錄

那麼怎樣來操做頁面的前進和後退功能呢？

1 2	driver.forward() driver.back()

嗯，簡潔明瞭。

Cookies處理

爲頁面添加 Cookies，用法以下

# Go to the correct domain

driver.get("http://www.example.com")

# Now set the cookie. This one's valid for the entire domain

cookie = {‘name’ : ‘foo’, ‘value’ : ‘bar’}

driver.add_cookie(cookie)

獲取頁面 Cookies，用法以下

# Go to the correct domain

driver.get("http://www.example.com")

# And now output all the available cookies for the current URL

driver.get_cookies()

以上即是 Cookies 的處理，一樣是很是簡單的。

元素選取

關於元素的選取，有以下的API
單個元素選取

find_element_by_id

find_element_by_name

find_element_by_xpath

find_element_by_link_text

find_element_by_partial_link_text

find_element_by_tag_name

find_element_by_class_name

find_element_by_css_selector

多個元素選取

find_elements_by_name

find_elements_by_xpath

find_elements_by_link_text

find_elements_by_partial_link_text

find_elements_by_tag_name

find_elements_by_class_name

find_elements_by_css_selector

另外還能夠利用 By 類來肯定哪一種選擇方式

from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')

driver.find_elements(By.XPATH, '//button')

By 類的一些屬性以下

ID = "id"

XPATH = "xpath"

LINK_TEXT = "link text"

PARTIAL_LINK_TEXT = "partial link text"

NAME = "name"

TAG_NAME = "tag name"

CLASS_NAME = "class name"

CSS_SELECTOR = "css selector"

更詳細的元素選擇方法參見官方文檔

元素選擇

頁面等待

這是很是重要的一部分，如今的網頁愈來愈多采用了 Ajax 技術，這樣程序便不能肯定什麼時候某個元素徹底加載出來了。這會讓元素定位困難並且會提升產生 ElementNotVisibleException 的機率。

因此 Selenium 提供了兩種等待方式，一種是隱式等待，一種是顯式等待。

隱式等待是等待特定的時間，顯式等待是指定某一條件直到這個條件成立時繼續執行。

顯式等待

顯式等待指定某個條件，而後設置最長等待時間。若是在這個時間尚未找到元素，那麼便會拋出異常了。

from selenium import webdriver

from selenium.webdriver.common.by import By

from selenium.webdriver.support.ui import WebDriverWait

from selenium.webdriver.support import expected_conditions as EC

driver = webdriver.Chrome()

driver.get("http://somedomain/url_that_delays_loading")

try:

element = WebDriverWait(driver, 10).until(

EC.presence_of_element_located((By.ID, "myDynamicElement"))

)

finally:

driver.quit()

程序默認會 500ms 調用一次來查看元素是否已經生成，若是原本元素就是存在的，那麼會當即返回。

下面是一些內置的等待條件，你能夠直接調用這些條件，而不用本身寫某些等待條件了。

title_is

title_contains

presence_of_element_located

visibility_of_element_located

visibility_of

presence_of_all_elements_located

text_to_be_present_in_element

text_to_be_present_in_element_value

frame_to_be_available_and_switch_to_it

invisibility_of_element_located

element_to_be_clickable – it is Displayed and Enabled.

staleness_of

element_to_be_selected

element_located_to_be_selected

element_selection_state_to_be

element_located_selection_state_to_be

alert_is_present