自動化測試: Selenium 自動登陸受權,再 Requests 請求內容

Selenium 自動登陸網站、截圖及 Requests 抓取登陸後的網頁內容。一塊兒瞭解下吧。javascript

  • Selenium: 支持 Web 瀏覽器自動化的一系列工具和庫的綜合項目。
  • Requests: 惟一的一個非轉基因的 Python HTTP 庫,人類能夠安全享用。

爲何選擇 Selenium 實現自動登陸?

Selenium 實現,至關於模擬用戶手動打開瀏覽器、進行登陸的過程。css

相比直接 HTTP 請求登陸,有幾個好處:html

  1. 避免登陸窗口的複雜狀況(iframe, ajax 等),免得分析細節。java

    • 用 Selenium 實現,依照用戶操做流程便可。
  2. 避免模擬 Headers 、記錄 Cookies 等 HTTP 完成登陸的細節。python

    • 用 Selenium 實現,依賴瀏覽器自身功能便可。
  3. 利於實現加載等待、發現特殊狀況(登陸驗證等),加進一步邏輯。

另外,自動登陸等過程的可視化,給外行看挺讓人感受高端的。git

爲何選擇 Requests 抓取網頁內容?

抓取登陸後的某些內容,而非爬取網站, Requests 夠用、好用。github

1) 準備 Selenium

基礎環境: Python 3.7.4 (anaconda3-2019.10)web

pip 安裝 Selenium :ajax

pip install selenium

獲取 Selenium 版本信息:算法

$ python
Python 3.7.4 (default, Aug 13 2019, 15:17:50)
[Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import selenium
>>> print('Selenium version is {}'.format(selenium.__version__))
Selenium version is 3.141.0

2) 準備瀏覽器及其驅動

下載 Google Chrome 瀏覽器並安裝:
https://www.google.com/chrome/

下載 Chromium/Chrome WebDriver:
https://chromedriver.storage....

而後,將 WebDriver 路徑加入到 PATH ,例如:

# macOS, Linux
export PATH=$PATH:/opt/WebDriver/bin >> ~/.profile

# Windows
setx /m path "%path%;C:\WebDriver\bin\"

3) Go coding!

讀取登陸配置

登陸信息是私密的,咱們從 json 配置讀取:

# load config
import json
from types import SimpleNamespace as Namespace

secret_file = 'secrets/douban.json'
# {
#   "url": {
#     "login": "https://www.douban.com/",
#     "target": "https://www.douban.com/mine/"
#   },
#   "account": {
#     "username": "username",
#     "password": "password"
#   }
# }
with open(secret_file, 'r', encoding='utf-8') as f:
  config = json.load(f, object_hook=lambda d: Namespace(**d))

login_url = config.url.login
target_url = config.url.target
username = config.account.username
password = config.account.password

Selenium 自動登陸

以 Chrome WebDriver 實現,登陸測試站點爲「豆瓣」。

打開登陸頁面,自動輸入用戶名、密碼,進行登陸:

# automated testing
from selenium import webdriver

# Chrome Start
opt = webdriver.ChromeOptions()
driver = webdriver.Chrome(options=opt)
# Chrome opens with 「Data;」 with selenium
#   https://stackoverflow.com/questions/37159684/chrome-opens-with-data-with-selenium
# Chrome End

# driver.implicitly_wait(5)

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
wait = WebDriverWait(driver, 5)

print('open login page ...')
driver.get(login_url)
driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0])

driver.find_element_by_css_selector('li.account-tab-account').click()
driver.find_element_by_name('username').send_keys(username)
driver.find_element_by_name('password').send_keys(password)
driver.find_element_by_css_selector('.account-form .btn').click()
try:
  wait.until(EC.presence_of_element_located((By.ID, "content")))
except TimeoutException:
  driver.quit()
  sys.exit('open login page timeout')

若是用 IE 瀏覽器,以下:

# Ie Start
# Selenium Click is not working with IE11 in Windows 10
#   https://github.com/SeleniumHQ/selenium/issues/4292
opt = webdriver.IeOptions()
opt.ensure_clean_session = True
opt.ignore_protected_mode_settings = True
opt.ignore_zoom_level = True
opt.initial_browser_url = login_url
opt.native_events = False
opt.persistent_hover = True
opt.require_window_focus = True
driver = webdriver.Ie(options = opt)
# Ie End

若是設定更多功能,能夠:

cap = opt.to_capabilities()
cap['acceptInsecureCerts'] = True
cap['javascriptEnabled'] = True

打開目標頁面,進行截圖

print('open target page ...')
driver.get(target_url)
try:
  wait.until(EC.presence_of_element_located((By.ID, "board")))
except TimeoutException:
  driver.quit()
  sys.exit('open target page timeout')

# save screenshot
driver.save_screenshot('target.png')
print('saved to target.png')

Requests 復刻 Cookies ,請求 HTML

# save html
import requests

requests_session = requests.Session()
selenium_user_agent = driver.execute_script("return navigator.userAgent;")
requests_session.headers.update({"user-agent": selenium_user_agent})
for cookie in driver.get_cookies():
  requests_session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain'])

# driver.delete_all_cookies()
driver.quit()

resp = requests_session.get(target_url)
resp.encoding = resp.apparent_encoding
# resp.encoding = 'utf-8'
print('status_code = {0}'.format(resp.status_code))
with open('target.html', 'w+') as fout:
  fout.write(resp.text)

print('saved to target.html')

4) 運行測試

能夠臨時將 WebDriver 路徑加入到 PATH :

# macOS, Linux
export PATH=$(pwd)/drivers:$PATH

# Windows
set PATH=%cd%\drivers;%PATH%

運行 Python 腳本,輸出信息以下:

$ python douban.py
Selenium version is 3.141.0
--------------------------------------------------------------------------------
open login page ...
open target page ...
saved to target.png
status_code = 200
saved to target.html

截圖 target.png, HTML 內容 target.html ,結果以下:

douban_result.png

結語

登陸過程若是遇到驗證呢?

  1. 滑動驗證,能夠 Selenium 模擬

    • 滑動距離,圖像梯度算法可判斷
  2. 圖文驗證,能夠 Python AI 庫識別

參考

本文代碼 Gist 地址:
https://gist.github.com/ikuok...


分享 Coding 中實用的小技巧、小知識!歡迎關注,共同成長!

GoCoding_WeChat.png

相關文章
相關標籤/搜索