b站滑動驗證碼圖片的獲取-python

時間 2019-12-18

標籤滑動驗證碼圖片獲取 python 欄目 Python 简体版

原文原文鏈接

本文僅是獲取驗證碼圖片，python+selenium實現php

圖片的處理，算出偏移位置網上都有現成的；而因爲b站的更新，圖片的獲取則與以前徹底不一樣，不能直接從html中拿到html

過程比較曲折因此記錄一下，可能比較長java

從分析的過程來展開，剛開始的分析最終發現有些問題，雖然能夠拿到圖片但與當前的驗證碼圖片不一致；python

通過前面的經歷，找到了後面的方法，能夠成功獲取到當前圖片git

1、（可直接看二，測試可行的）github

分析結果：兩個參數challenge/gt，其中gt是固定的b6cc0fc51ec7995d8fd3c637af690de3，而challenge每次請求都不同，因此關鍵在於challengeweb

1.故事的開始，combine接口，沒有請求參數，返回challenge字符串chrome

2.get.php接口，請求參數不少，但有用的只有challenge；返回了最重要的驗證碼圖片地址；json

3.然鵝，看起來雖然經過combine接口→→get接口便可得到圖片地址api

但實際上get進行第二次請求時不返回數據，而返回了錯誤信息，錯誤信息是舊的參數，難道須要用新的challenge參數請求？但是新的在哪呢

4.就在被卡住時，驗證碼刷新了，發起了reset和refresh請求，而其中refresh與get同樣，返回的是圖片地址，因此出現了起色

refresh的請求參數正是get返回的challenge，最重要的是refresh接口能夠重複請求，得到圖片地址（故事在這裏埋下了伏筆）

既然是用新的challenge，那麼用refresh返回的試一下，結果是不行，依然顯示old_challenge

那麼換一下思路，用get的challenge參數，而接口用refresh，是否是就能返回get接口的圖片的地址了，實際上還真獲取到了

因此就覺得圖片實際的獲取接口是refresh，如今只要拿到get接口的challenge就能夠了，而這個以前就已經開始實現了

而後就是碼代碼了。。

coding。。。

完成

測試一下吧，圖片確實保存在本地，過程也都沒什麼了問題了

而後就是點開看一下圖片

咦！？不對啊，圖片和頁面上的不同啊

而後才意識到是以前梳理的邏輯出問題了

分析ing。。。

測試發現，帶着同一個challenge參數的refresh每次返回的圖片地址都不同，因此後臺應該是隨機返回圖片，並且後臺也不保存每次生成的圖片，圖片都是臨時的，固然每次地址都不同了；這樣的話，圖片的接口根本就不是refresh，以前的get和refresh本質是同樣的，都是隨機返回圖片；

（至於爲何只能請求一次並且返回的錯誤信息是old_challeng就不知道了；其實這裏面還有一條線，就是reset接口，伴隨refresh出現，第一次帶着combine返回的參數請求，返回新的challenge和s，推測是js根據這兩個參數生成新的challenge，即new challenge，帶着這個參數才能請求到數據）（c也很可疑，驗證碼被分紅了52份，這些會不會和順序有關係？）

因此這條路就走不通了，只能試試其餘方法

2、直接經過selenium獲取請求的響應

0.browsermob-proxy

先試了browsermob-proxy，即經過代理獲取瀏覽器請求信息，但https沒法請求成功，查了下，由於是由java寫的，因此對Java實現的比較好，應該能夠解決，但python查了不少資料都沒有解決方法

但若是沒有https問題，其實browsermob-proxy挺好用的，能夠經過proxy.new_har(「」)建立har文件，一種json格式文件，以後請求的全部信息都會保存在proxy.har中，能夠直接寫入文件中，數據很也直觀;

實現獲取http請求信息：

from browsermobproxy import Server
from selenium import webdriver
import json

# browsermob-proxy.bat的路徑
server = Server(r"xxx.\browsermob-proxy\bin\browsermob-proxy.bat")
server.start()
proxy = server.create_proxy()

# 建立har
proxy.new_har("google")
chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument("--proxy-server={0}".format(proxy.proxy))

driver = webdriver.Chrome(options=chrome_options)
driver.get("http://www.xxx.com/")
proxy.wait_for_traffic_to_stop(1, 60)

# 保存har中的信息到本地
with open('1.har', 'w') as outfile:
    json.dump(proxy.har, outfile,indent=2,ensure_ascii=False)

請求信息，所有在entries的列表中，每一個請求保存爲一個字典，其中的鍵主要有：request/response/timings

{
  "log": {
    "entries": [
      {
        "cache": {},
        "time": 569,
        "startedDateTime": "2019-09-10T16:32:33.342+0000",
        "request": {
          "method": "GET",
          "url": "http://www.yiguo.com/",
          "httpVersion": "HTTP",
          "headersSize": 0,
          "headers": [],
          "queryString": [],
          "cookies": [],
          "bodySize": 0
        },
        "response": {
          "content": {
            "size": 10693,
            "mimeType": "text/html; charset=utf-8"
          },
          "httpVersion": "HTTP",
          "headersSize": 0,
          "redirectURL": "",
          "statusText": "OK",
          "headers": [],
          "status": 200,
          "cookies": [
            {
              "name": "CityCSS",
              "value": "UnitId=1&AreaId=7bc089fd-9d27-4e5f-a2e1-65907c5a5399&UnitName=%e4%b8%8a%e6%b5%b7",
              "path": "/",
              "domain": "yiguo.com",
              "expires": "2020-09-10T16:32:35.000+0000"
            }
          ],
          "bodySize": 10693
        },
        "timings": {
          "dns": 211,
          "receive": 3,
          "connect": 82,
          "send": 0,
          "blocked": 0,
          "wait": 273
        },
        "serverIPAddress": "150.242.239.211",
        "pageref": "baidu"
      },

最終方案：經過webdriver自帶的API

1.獲取請求信息

參考：Browser performance tests through selenium—stackoverflow

初版：

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

caps = DesiredCapabilities.CHROME

# 必須有這一句，才能在後面獲取到performance
caps['loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(desired_capabilities=caps)

driver.get('https://stackoverflow.com')

# 重要：獲取瀏覽器請求的信息，包括了每個請求的請求方法/請求頭，requestId等信息
logs = [json.loads(log['message'])['message'] for log in driver.get_log('performance')]

with open('devtools.json', 'wb') as f:
    json.dump(logs, f)

driver.close()

但實際會報錯：invalid argument: log type 'performance' not found，即get_log('performance')]出錯

解決：Selenium Chrome can't see browser logs InvalidArgumentException

第二版：加上chrome_options.add_experimental_option('w3c', False)

from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities

chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('w3c', False)
caps = DesiredCapabilities.CHROME
caps['loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(desired_capabilities=caps,options=chrome_options)
driver.get('https://stackoverflow.com')

# 重要：獲取瀏覽器請求的信息，包括了每個請求的請求方法/請求頭，requestId等信息
logs = [json.loads(log['message'])['message'] for log in driver.get_log('performance')]

with open('devtools.json', 'wb') as f:
    json.dump(logs, f)

driver.close()

到這裏就獲取到了請求的的信息，包含着各個請求的url，和後面用到的requestId

（get_log('performance'）的返回數據，參考：selenium 如何抓取請求信息）

2.根據請求信息中的requestId獲取響應

啓發：selenium 獲取請求返回內容的解決方案

他在文中提到了：

// 獲取請求返回內容 session.getCommand().getNetwork().getResponseBody("requestIdxxxxx");

但沒有獲取到響應內容，最終發現能夠經過ExecuteSendCommandAndGetResult來實現，只要傳 cmd 與 params 命令就能夠調用這個接口，最後本身經過代碼實現，不過是Java的也看不太懂

但能夠直接去python中的selenium源碼看，是否有相似的接口，結果還真給找到了

selenium/webdriver/chrom/webdriver/下的WebDriver類的一個方法：execute_cdp_cmd()就實現了這樣的功能；

而咱們通常用的webdriver.Chrom()，返回的就是WebDriver的實例對象

源碼以下：

def execute_cdp_cmd(self, cmd, cmd_args):

    """
    Execute Chrome Devtools Protocol command and get returned result

    The command and command args should follow chrome devtools protocol domains/commands, refer to link
    https://chromedevtools.github.io/devtools-protocol/

    :Args:
     - cmd: A str, command name
     - cmd_args: A dict, command args. empty dict {} if there is no command args

    :Usage:
        driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': requestId})

    :Returns:
        A dict, empty dict {} if there is no result to return.
        For example to getResponseBody:

        {'base64Encoded': False, 'body': 'response body string'}

    """

    return self.execute("executeCdpCommand", {'cmd': cmd, 'params': cmd_args})['value']

def execute(self, driver_command, params=None):

    """
    Sends a command to be executed by a command.CommandExecutor.

    :Returns:
     The command's JSON response loaded into a dictionary object.
    """

    response = self.command_executor.execute(driver_command, params)

    return response

def execute(self, command, params)：

    return self._request(command_info[0], url, body=data)

def _request(self, method, url, body=None):

"""

    Send an HTTP request to the remote server.

    :Returns:
      A dictionary with the server's parsed JSON response.
    """

    # 太長只看其中的邏輯部分

    resp = self._conn.request(method, url, body=body, headers=headers)

    data = resp.data.decode('UTF-8')

    return data

思路：

1.經過正則在前面拿到的請求信息中，匹配到想要獲取的請求所對應的的requestId

2.而後直接調用execute_cdp_cmd()接口，傳入requestId

pat = r"""https://api\.geetest\.com/get\.php\?is_next.*?\".*?\"requestId\": \"(\d+?\.\d+?)\","""


requestId = re.findall(pat, browser_log, re.S)[0]

response_dict = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': requestId})

# body即爲以前提到的get接口返回的json數據，其中包含了驗證碼圖片的地址

body = response_dict["body"]

3.拿到驗證碼url，便可用requests模塊請求，最終保存在本地

最後附上完整代碼：

import json
import requests
from selenium import webdriver
from selenium.webdriver.common.desired_capabilities import DesiredCapabilities
import time
import re

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36"
}


chrome_options = webdriver.ChromeOptions()
chrome_options.add_experimental_option('w3c', False)
caps = DesiredCapabilities.CHROME
caps['loggingPrefs'] = {'performance': 'ALL'}
driver = webdriver.Chrome(desired_capabilities=caps,options=chrome_options)
driver.get('https://passport.bilibili.com/login')


def input_click_01():
    input_name = driver.find_element_by_xpath("//input[@id='login-username']")
    input_pwd = driver.find_element_by_xpath("//input[@id='login-passwd']")

    input_name.send_keys("username")
    input_pwd.send_keys("passport")

    time.sleep(3)
    login_btn = driver.find_element_by_class_name("btn-login")
    login_btn.click()
    time.sleep(5)


def browser_log_02():
    browser_log_list = driver.get_log("performance")

    # 先保存到文件，利於測試，和後面的正則匹配
    logs = [json.loads(log['message'])['message'] for log in browser_log_list]
    with open('devtools.json', 'w') as f:
        json.dump(logs, f, indent=4, ensure_ascii=False)

    with open('devtools.json', 'r') as f:
        browser_log = f.read()
    print("瀏覽器日誌獲取完成")
    return browser_log


def get_response_img_url_03(browser_log):
    # 獲取requestId
    # 獲取到的有兩種，取前者，暫時沒出錯，出現異常再進行篩選
    pat = r"""https://api\.geetest\.com/get\.php\?is_next.*?\".*?\"requestId\": \"(\d+?\.\d+?)\","""
    requestId = re.findall(pat, browser_log, re.S)[0]
    # print(requestId)

    # 最重要的一步：調用接口，經過requestId獲取請求的響應
    response_dict = driver.execute_cdp_cmd('Network.getResponseBody', {'requestId': requestId})
    body = response_dict["body"]
    # print(body)

    # 從響應中獲取圖片連接
    fullbg = re.findall(r"fullbg\":.\"(.*?)\",",body)
    bg = re.findall(r"\"bg\":.\"(.*?)\",",body)
    fullbg_url = "https://static.geetest.com/" + fullbg[0]
    bg_url = "https://static.geetest.com/" + bg[0]

    return fullbg_url,bg_url


def get_img_04(fullbg_url,bg_url):
    # 請求
    origin_img_data = requests.get(fullbg_url, headers=headers).content
    fix_img_data = requests.get(bg_url, headers=headers).content

    # 先保存圖片
    with open("原圖.jpg", "wb") as f:
        f.write(origin_img_data)
    with open("缺口圖.png", "wb") as f:
        f.write(fix_img_data)
    print("保存圖片完成")


def main():
    input_click_01()
    log_data = browser_log_02()
    url_tuple = get_response_img_url_03(log_data)
    get_img_04(*url_tuple)
    driver.close()

if __name__ == '__main__':
    main()