Selenium 自動登陸網站、截圖及 Requests 抓取登陸後的網頁內容。一塊兒瞭解下吧。javascript
Selenium 實現,至關於模擬用戶手動打開瀏覽器、進行登陸的過程。css
相比直接 HTTP 請求登陸,有幾個好處:html
避免登陸窗口的複雜狀況(iframe, ajax 等),免得分析細節。java
避免模擬 Headers 、記錄 Cookies 等 HTTP 完成登陸的細節。python
另外,自動登陸等過程的可視化,給外行看挺讓人感受高端的。git
抓取登陸後的某些內容,而非爬取網站, Requests 夠用、好用。github
基礎環境: Python 3.7.4 (anaconda3-2019.10)web
pip 安裝 Selenium :ajax
pip install selenium
獲取 Selenium 版本信息:算法
$ python Python 3.7.4 (default, Aug 13 2019, 15:17:50) [Clang 4.0.1 (tags/RELEASE_401/final)] :: Anaconda, Inc. on darwin Type "help", "copyright", "credits" or "license" for more information. >>> import selenium >>> print('Selenium version is {}'.format(selenium.__version__)) Selenium version is 3.141.0
下載 Google Chrome 瀏覽器並安裝:
https://www.google.com/chrome/
下載 Chromium/Chrome WebDriver:
https://chromedriver.storage....
而後,將 WebDriver 路徑加入到 PATH ,例如:
# macOS, Linux export PATH=$PATH:/opt/WebDriver/bin >> ~/.profile # Windows setx /m path "%path%;C:\WebDriver\bin\"
登陸信息是私密的,咱們從 json 配置讀取:
# load config import json from types import SimpleNamespace as Namespace secret_file = 'secrets/douban.json' # { # "url": { # "login": "https://www.douban.com/", # "target": "https://www.douban.com/mine/" # }, # "account": { # "username": "username", # "password": "password" # } # } with open(secret_file, 'r', encoding='utf-8') as f: config = json.load(f, object_hook=lambda d: Namespace(**d)) login_url = config.url.login target_url = config.url.target username = config.account.username password = config.account.password
以 Chrome WebDriver 實現,登陸測試站點爲「豆瓣」。
打開登陸頁面,自動輸入用戶名、密碼,進行登陸:
# automated testing from selenium import webdriver # Chrome Start opt = webdriver.ChromeOptions() driver = webdriver.Chrome(options=opt) # Chrome opens with 「Data;」 with selenium # https://stackoverflow.com/questions/37159684/chrome-opens-with-data-with-selenium # Chrome End # driver.implicitly_wait(5) from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support.ui import WebDriverWait from selenium.webdriver.support import expected_conditions as EC wait = WebDriverWait(driver, 5) print('open login page ...') driver.get(login_url) driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[0]) driver.find_element_by_css_selector('li.account-tab-account').click() driver.find_element_by_name('username').send_keys(username) driver.find_element_by_name('password').send_keys(password) driver.find_element_by_css_selector('.account-form .btn').click() try: wait.until(EC.presence_of_element_located((By.ID, "content"))) except TimeoutException: driver.quit() sys.exit('open login page timeout')
若是用 IE 瀏覽器,以下:
# Ie Start # Selenium Click is not working with IE11 in Windows 10 # https://github.com/SeleniumHQ/selenium/issues/4292 opt = webdriver.IeOptions() opt.ensure_clean_session = True opt.ignore_protected_mode_settings = True opt.ignore_zoom_level = True opt.initial_browser_url = login_url opt.native_events = False opt.persistent_hover = True opt.require_window_focus = True driver = webdriver.Ie(options = opt) # Ie End
若是設定更多功能,能夠:
cap = opt.to_capabilities() cap['acceptInsecureCerts'] = True cap['javascriptEnabled'] = True
print('open target page ...') driver.get(target_url) try: wait.until(EC.presence_of_element_located((By.ID, "board"))) except TimeoutException: driver.quit() sys.exit('open target page timeout') # save screenshot driver.save_screenshot('target.png') print('saved to target.png')
# save html import requests requests_session = requests.Session() selenium_user_agent = driver.execute_script("return navigator.userAgent;") requests_session.headers.update({"user-agent": selenium_user_agent}) for cookie in driver.get_cookies(): requests_session.cookies.set(cookie['name'], cookie['value'], domain=cookie['domain']) # driver.delete_all_cookies() driver.quit() resp = requests_session.get(target_url) resp.encoding = resp.apparent_encoding # resp.encoding = 'utf-8' print('status_code = {0}'.format(resp.status_code)) with open('target.html', 'w+') as fout: fout.write(resp.text) print('saved to target.html')
能夠臨時將 WebDriver 路徑加入到 PATH :
# macOS, Linux export PATH=$(pwd)/drivers:$PATH # Windows set PATH=%cd%\drivers;%PATH%
運行 Python 腳本,輸出信息以下:
$ python douban.py Selenium version is 3.141.0 -------------------------------------------------------------------------------- open login page ... open target page ... saved to target.png status_code = 200 saved to target.html
截圖 target.png
, HTML 內容 target.html
,結果以下:
登陸過程若是遇到驗證呢?
滑動驗證,能夠 Selenium 模擬
本文代碼 Gist 地址:
https://gist.github.com/ikuok...
分享 Coding 中實用的小技巧、小知識!歡迎關注,共同成長!