RPA—pyautogui+PIL+pandas破解全版本(2.0-3.0)滑動驗證碼,獲取表格數據


我的公衆號:螺旋編程極客  >>期待您的關注css


引言

  最近公司有個新需求,大致流程是這樣的,進入天津市市場主體信用信息公示系統,根據excel中表格的企業名稱或稅號查詢企業的股東信息,查到以後獲取股東信息的稅號,而後再分別查詢股東的股東,最後把查詢結果錄入excel。   讀excel——>查詢企業股東——》獲取股東稅號——》輸入股東稅號查詢其股東——》查詢結果錄入excel,是否是讓人以爲十分無語,簡單一句話,查詢股東的股東的相關信息錄入excel,當時聽到這個需求感受理論上是能夠實現的,惟一的難點就在於滑塊驗證碼,破解了它以後後面的就是一些網頁數據提取的工做了。html

破解滑塊驗證碼

  話很少說,上爬蟲唄,由於有滑塊驗證碼這個東西的存在,因此只能選擇瀏覽器爬蟲了,雖然效率慢點,可是萬物皆可爬,由於抓包分析那些請求數據實在是讓人噁心的想吐。在這裏我使用 「藝賽旗RPA設計器」 來輔助完成工做,不得不說,這個東西真的好用,並且它的python庫十分強大,設計完流程能夠自動生成python代碼,本身只須要關心一些核心的算法和業務邏輯就能夠了,事半功倍。   首先看一下驗證碼的圖片:python

  是比較常見的 「極驗」 驗證碼,不少網站都在使用這個東西,可是政府的網站明顯落後了一點,如今 "極驗3.0" 已經更新了,這個還停留在2.0。區別就是2.0一開始顯示的是完整的圖片,點擊滑動按鈕會出現有缺口的圖片,而3.0一開始顯示的就是帶缺口的圖片,不過也是能夠破解的。web

  • 2.0破解方法: 對原圖和帶缺口的圖分別截圖,比較兩張圖的像素點,從而算出須要移動的距離,準確率能夠高達80%。
  • 3.0破解方法: ——1 .有些網站是利用css樣式隱藏了滑塊,找到標籤利用python執行js腳本把display:none去除,後面就跟2.0同樣了。 ——2.直接對帶缺口的圖片截圖,缺口部分的rgb一般都小於150,這樣就能夠找到缺口位置!

  在這裏咱們以2.0爲例,3.0的核心代碼我也會貼上,先看一下2.0破解的步驟:算法

點擊搜索-->截取驗證碼圖片-->點擊滑動按鈕-->截取帶缺口圖片-->比較像素計算偏移量-->移動
複製代碼

  由於咱們使用了RPA設計器,因此像點擊鼠標,截圖之類的代碼都不須要本身去寫,選擇相應的元素,點擊對應頁面的元素,他就能夠自動爲咱們生成python代碼,固然是高度封裝的,源碼是能夠隨時看的,底層其實仍是那一套。惟一須要咱們動手寫的是計算偏移量以及鼠標移動,雖然他自己有鼠標拖動的組件,可是拖動的時候過於直來直去,會被檢測到,提示 「被怪物吃掉」 因此我稍微修改了一下他的源碼,封裝了一個本身的方法,先看一下驗證碼識別的流程圖: 編程

  設計好了流程圖設計器就能夠幫咱們自動生成代碼,代碼以下:

# coding=utf-8
# 編譯日期:2019-08-14 10:09:34
import time
import pdb
from ubpa.ilog import ILog
from ubpa.base_img import *
import getopt
from sys import argv
import sys
from ubpa.itools import rpa_import
GlobalFun = rpa_import.import_global_fun(__file__)
import ubpa.ibox as ibox
import ubpa.iexcel as iexcel
import ubpa.ifile as ifile
import ubpa.iie as iie
import ubpa.iimg as iimg
import ubpa.ikeyboard as ikeyboard

class getTjInfo:
     
    def __init__(self,**kwargs):
        self.__logger = ILog(__file__)
        self.path = set_img_res_path(__file__)
        self.robot_no = ''
        self.proc_no = ''
        self.job_no = ''
        if('robot_no' in kwargs.keys()):
            self.robot_no = kwargs['robot_no']
        if('proc_no' in kwargs.keys()):
            self.proc_no = kwargs['proc_no']
        if('job_no' in kwargs.keys()):
            self.job_no = kwargs['job_no']
    #驗證碼識別
    def checkCode(self):
        existFlg=None
        distance=None
        xy=None
        imageTwo=None
        imageOne=None
        # 截圖
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530184,Note:')
        imageOne = iimg.capture_image(win_title=r'天津市市場主體',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
        # 鼠標點擊
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530183,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
        time.sleep(4)
        # 截圖
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530186,Note:')
        imageTwo = iimg.capture_image(win_title=r'天津市市場主體',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
        # 自定義函數
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530182,Note:')
        distance = GlobalFun.get_distance(imageOne,imageTwo)
        # 獲取元素位置
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530181,Note:')
        xy = iie.get_element_rect(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',curson=r'center',waitfor=10)
        # 代碼塊
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530180,Note:')
        print(xy)
        lastxy=(xy[0]+distance,xy[1],xy[2],xy[3])
        print(lastxy)
        if(xy==(847.0, 682.0, 44, 44) and lastxy==(900.0, 682.0, 44, 44)):
          print('修正')
          lastxy=(895.0, 682.0, 44, 44)
        if(xy==(847.0, 682.0, 44, 44) and lastxy==(976.0, 682.0, 44, 44)):
          print('修正')
          lastxy=(868.0, 682.0, 44, 44)
        # 自定義函數
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530185,Note:')
        GlobalFun.myDo_drag_to(win_title=r'天津市市場主體', srcpos=xy,distpos=lastxy)
        #刪除文件
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530187,Note:')
        ifile.del_file(file=imageOne)
        #刪除文件
        self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530188,Note:')
        ifile.del_file(file=imageTwo)
        # 圖像檢測
        self.__logger.debug('Flow:checkCode,StepNodeTag:13140204311151,Note:')
        time.sleep(3.5)
        existFlg = iimg.img_exists(win_title=r'天津市市場主體',img_res_path=self.path,image=r'snapshot_20190813135330024.png',fuzzy=True,confidence=0.85,waitfor=3)
        # IF分支
        self.__logger.debug('Flow:checkCode,StepNodeTag:13140549531176,Note:')
        if existFlg:
            #消息框
            self.__logger.debug('Flow:checkCode,StepNodeTag:13143951406201,Note:')
            ibox.msg_box(msg='驗證失敗,重試!',timeout=1.5)
            time.sleep(1)
            # 鼠標點擊
            self.__logger.debug('Flow:checkCode,StepNodeTag:13140738964184,Note:')
            iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(3) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=100,scroll_view='no')
            time.sleep(1.5)
            # Return返回
            self.__logger.debug('Flow:checkCode,StepNodeTag:13140556594179,Note:')
            return True
        else:
            # Return返回
            self.__logger.debug('Flow:checkCode,StepNodeTag:13140620186183,Note:')
            return False
        # 代碼塊
        self.__logger.debug('Flow:checkCode,StepNodeTag:13141700326199,Note:')
        print(existFlg)
    #處理表格數據
    def dealTableData(self,tableData=None):
        currentCom=None
        currentTableData=None
        currentComName=None
        # 代碼塊
        self.__logger.debug('Flow:dealTableData,StepNodeTag:13161341316275,Note:')
        columns=tableData.columns
        realDataList=tableData.values.tolist()
        # 熱鍵輸入
        self.__logger.debug('Flow:dealTableData,StepNodeTag:13161638010281,Note:')
        ikeyboard.key_send_cs(win_title=r'天津市市場主體',text='^{F4}',waitfor=10)
        # 熱鍵輸入
        self.__logger.debug('Flow:dealTableData,StepNodeTag:13164935964357,Note:')
        ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
        #消息框
        self.__logger.debug('Flow:dealTableData,StepNodeTag:13161731539284,Note:')
        ibox.msg_box(msg='開始處理二級公司數據',timeout=2)
        time.sleep(0.002)
        # For循環
        self.__logger.debug('Flow:dealTableData,StepNodeTag:13161910442289,Note:')
        for i in range(len(realDataList)):
            # 代碼塊
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13162051185291,Note:')
            currentList=realDataList[i]
            if(columns[0]=='有限責任公司本年度是否有股權轉讓 '):
                currentCom=currentList[0]
                currentComName=currentList[0]
            if(columns[0]=='企業是否有股權信息或購買其它公司股權'):
                currentCom=currentList[0]
                currentComName=currentList[1]
            if("天津" not in currentCom):
                continue
            time.sleep(1)
            # 子流程:finishCheckCode
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13161503452279,Note:')
            self.finishCheckCode(comName=currentComName)
            # 子流程:goToDetail
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13163507635351,Note:')
            currentTableData=self.goToDetail()
            # 代碼塊
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13170553111368,Note:')
            currentTableData[0].drop(['變動後股權比例','股權變動日期'], axis=1)
            currentTableData[1].drop(['投資設立企業後購買股權企業名稱',r'統一社會信用代碼/註冊號'], axis=1)
            lastTableData0=currentTableData[0].values.tolist()
            lastTableData1=currentTableData[1].values.tolist()
            #插入行
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13171433395371,Note:')
            iexcel.ins_row(path='C:/Users/Administrator/Desktop/testData.xlsx',data=lastTableData1)
            # 熱鍵輸入
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13165826952364,Note:')
            ikeyboard.key_send_cs(win_title=r'天津市市場主體',text='^{F4}',waitfor=10)
            # 熱鍵輸入
            self.__logger.debug('Flow:dealTableData,StepNodeTag:13165850702366,Note:')
            ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
    #完成驗證
    def finishCheckCode(self,comName='911200006630613577'):
        #網站
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082917,Note:')
        iie.open_url(url=r'http://credit.scjg.tj.gov.cn/gsxt/')
        # 鼠標點擊
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082912,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=300,scroll_view='no')
        time.sleep(0.5)
        # 鼠標點擊
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082913,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',offsetY=45,times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
        time.sleep(2)
        # 設置文本
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108293,Note:')
        iie.set_text(url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#searchName',text=comName,waitfor=10)
        # 鼠標點擊
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108292,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#entSearchLink',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
        time.sleep(2.5)
        # While循環
        self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135310584131,Note:')
        while True:
            # 子流程:checkCode
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13140310308161,Note:')
            tvar13140310308161=self.checkCode()
            # IF分支
            self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135336032135,Note:')
            if tvar13140310308161:
                pass
            else:
                # Break中斷
                self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135345176138,Note:')
                break
    #獲取股東公司信息
    def getChildCom(self):
        tableData2=None
        tableData1=None
        table2Columns=None
        table1Columns=None
        tableDatas=None
        # 子流程:goToDetail
        self.__logger.debug('Flow:getChildCom,StepNodeTag:13162921190325,Note:')
        tableDatas=self.goToDetail()
        # IF分支
        self.__logger.debug('Flow:getChildCom,StepNodeTag:13145633622210,Note:')
        if tableDatas[0].columns[1]=='否':
            pass
        else:
            # 代碼塊
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13163242189338,Note:')
            tableData1=tableDatas[0]
            # 子流程:dealTableData
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13163127828333,Note:')
            self.dealTableData(tableData=tableData1)
        # IF分支
        self.__logger.debug('Flow:getChildCom,StepNodeTag:13145822725214,Note:')
        if tableDatas[1].columns[1]=='否':
            pass
        else:
            # 代碼塊
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13163302199339,Note:')
            tableData2=tableDatas[1]
            # 子流程:dealTableData
            self.__logger.debug('Flow:getChildCom,StepNodeTag:13163131389335,Note:')
            self.dealTableData(tableData=tableData2)
        #消息框
        self.__logger.debug('Flow:getChildCom,StepNodeTag:13151905124229,Note:')
        ibox.msg_box(msg='當前企業數據處理完畢,下一個。。',timeout=1.5)
        time.sleep(1.5)
    #去往詳情頁
    def goToDetail(self):
        table2Data=None
        table1Data=None
        # 鼠標點擊
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929301,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#center_content > DIV:nth-of-type(1) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(1) > H1:nth-of-type(1) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=20,scroll_view='no')
        time.sleep(1.5)
        # 鼠標點擊
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929300,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tabs > DIV:nth-of-type(1) > DIV:nth-of-type(3) > SPAN:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
        # 鼠標點擊
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929299,Note:')
        iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tableInfoDiv > DIV:nth-of-type(2) > TABLE:nth-of-type(1) > TBODY:nth-of-type(1) > TR:nth-of-type(3) > TD:nth-of-type(4) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
        # 自定義函數
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929298,Note:股權轉讓')
        table1Data = GlobalFun.getTableData('年報詳情','#show_alter')
        # 自定義函數
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929297,Note:是否有狗買')
        table2Data = GlobalFun.getTableData('年報詳情','#show_invest')
        # Return返回
        self.__logger.debug('Flow:goToDetail,StepNodeTag:13162803390322,Note:')
        return table1Data,table2Data
      
    def Main(self):
        # 子流程:finishCheckCode
        self.__logger.debug('Flow:Main,StepNodeTag:13165330947360,Note:')
        self.finishCheckCode(comName='911200006630613577')
        # 子流程:getChildCom
        self.__logger.debug('Flow:Main,StepNodeTag:13151828292226,Note:')
        self.getChildCom()
        #消息框
        self.__logger.debug('Flow:Main,StepNodeTag:13152147770235,Note:')
        ibox.msg_box(msg='所有數據處理完畢!')
 
if __name__ == '__main__':
    robot_no = ''
    proc_no = ''
    job_no = ''
    try:
        argv = sys.argv[1:]
        opts, args = getopt.getopt(argv,"hr:p:j:",["robot = ","proc = ","job = "])
    except getopt.GetoptError:
        print ('robot.py -r <robot> -p <proc> -j <job>')
    for opt, arg in opts:
        if opt == '-h':
            print ('robot.py -r <robot> -p <proc> -j <job>')
        elif opt in ("-r", "--robot"):
            robot_no = arg
        elif opt in ("-p", "--proc"):
            proc_no = arg
        elif opt in ("-j", "--job"):
            job_no = arg
    pro = getTjInfo(robot_no=robot_no,proc_no=proc_no,job_no=job_no)
    pro.Main()

複製代碼

  使用的全局函數的代碼,在這裏咱們須要引入PIL庫來進行圖片的讀取以及像素的處理,具體方法見 get_distance ,引入pyautogui庫來對瀏覽器頁面進行操做,在這裏主要用它控制鼠標滑動,具體方法見 myDo_drag_to 引入pandas庫來進行頁面表格的數據獲取,具體方法見 getTableData ,以下:瀏覽器

# 編譯日期:2019-08-12 10:47:48
# coding=utf-8
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from PIL import Image
import ubpa.ics as ics
import pyautogui
from ubpa import iwin
import math
import ubpa.iie as iie
import re
import pandas as pd

def getTableData(titleStr,selectorStr):
    table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
    tb_start = re.compile('')
    tb_end = re.compile('')
    last_str = tb_end.sub('', tb_start.sub('', table_string))
    #調用了pandas中的read_html方法,注意header=0,有些表格header不是0
    data = pd.read_html(last_str, flavor="bs4", header=0)[0]
    print(data)
    print(data.columns)
    return data

def get_point_axis(axis_list,distpos,point):
    pos_val_list = []

    for i in range(1, 10000):
        if i >= point:
            break
        n = len(axis_list) * (i / (point + 1))
        pos_val = axis_list[int(n)]
        pos_val_list.append(pos_val)

    pos_val_list.append(distpos)

    return pos_val_list

def get_axis_list(srcpos=(0, 0), distpos=(0, 0)):
    pos_list = []
    x1 = srcpos[0]
    y1 = srcpos[1]
    x2 = distpos[0]
    y2 = distpos[1]

    if x1 == x2:
        if y1 > y2:
            for i in range(math.ceil(y2), int(y1) + 1):
                pos_list.append((x1, i))
                pos_list.reverse()
        elif y1 < y2:
            for i in range(math.ceil(y1), int(y2) + 1):
                pos_list.append((x1, i))
        else:
            pos_list = []
    else:
        if y1 == y2:
            if x1 < x2:
                x1 = math.ceil(x1)
                x2 = int(x2)
                length = x2 - x1
                for i in range(0, length + 1):
                    pos_list.append((x1 + i, y2))
            if x1 > x2:
                x1 = int(x1)
                x2 = math.ceil(x2)
                length = x1 - x2
                for i in range(0, length + 1):
                    pos_list.append((x1 + i, y2))
        else:
            if x1 < x2:
                for i in range(math.ceil(x1), int(x2) + 1):
                    if y1 < y2:
                        h = (i - x1) * (y2 - y1) / (x2 - x1)
                        pos_list.append((i, y1 + h))
                    else:
                        h = (i - x1) * (y1 - y2) / (x2 - x1)
                        pos_list.append((i, y1 - h))
            else:
                for i in range(math.ceil(x2), int(x1) + 1):
                    if y1 < y2:
                        h = (i - x2) * (y2 - y1) / (x1 - x2)
                        pos_list.append((i, y2 - h))
                    else:
                        h = (i - x2) * (y1 - y2) / (x1 - x2)
                        pos_list.append((i, y2 + h))

                pos_list.reverse()

    return pos_list

def myDo_drag_to(win_title=None, srcpos=(0,0), distpos=(0,0), point=0, stimes=1, model=pyautogui.easeInOutQuad, waitfor=10):
    ''' 驗證拖拽 x1:起點位置x座標 y1:起點位置y座標 x2:終點位置x座標 y2:終點位置y座標 point:停頓次數,默認是0 stimes:移動快慢,默認是1 model:移動方式,easeInQuad先慢後快,easeOutQuad先快後慢,easeInOutQuad開始和結束快 中間慢,easeInBounce結束反彈,easeInElastic持續反彈 '''
    try:
        if win_title != None and win_title.strip() != '':
            ''''若是窗口不活躍狀態'''
            if not iwin.do_win_is_active(win_title):
                iwin.do_win_activate(win_title=win_title, waitfor=2)

        pyautogui.moveTo(srcpos[0], srcpos[1], 0.5)

        pyautogui.mouseDown(button='left', _pause=True)

        axis_list = get_axis_list(srcpos, distpos)
        if len(axis_list) > 0:

            pos_val_list = get_point_axis(axis_list, distpos, point)
           # print(pos_val_list)
            for index in pos_val_list:

                pyautogui.dragTo(float(index[0]+20), float(index[1]), stimes, model)
                time.sleep(0.5)
                pyautogui.dragTo(float(index[0]-5), float(index[1]), stimes, model)
                time.sleep(0.5)
                pyautogui.dragTo(float(index[0]), float(index[1]), stimes, model)

            time.sleep(0.5)
            pyautogui.mouseUp(button='left', _pause=True)
    except Exception as e:
        raise e

# 2.0獲取偏移量
def get_distance(imageOne,imageTwo):
    ''' 拿到滑動驗證碼須要移動的距離 :param image1:沒有缺口的圖片對象 :param image2:帶缺口的圖片對象 :return:須要移動的距離 '''
    threshold=150
    left=60
    image1 = Image.open(imageOne)
    image2 = Image.open(imageTwo)
    for i in range(left,image1.size[0]):
        for j in range(image1.size[1]):
            rgb1=image1.load()[i,j]
            rgb2=image2.load()[i,j]
            res1=abs(rgb1[0]-rgb2[0])
            res2=abs(rgb1[1]-rgb2[1])
            res3=abs(rgb1[2]-rgb2[2])
            if not (res1 < threshold and res2 < threshold and res3 < threshold):
                print(i-7)
                return i-7 #通過測試,偏差爲大概爲7
    print(i-7)
    return i-7#通過測試,偏差爲大概爲7
複製代碼

  以上代碼爲整個流程的代碼,我在這裏全貼出來了,3.0驗證碼破解的獲取偏移量方法以下:bash

#極驗3.0破解方法
def get_gap(image):
    """ 獲取缺口偏移量 :param image: 帶缺口圖片 :return: """
    # left_list保存全部符合條件的x軸座標
    left_list = []
    # 須要獲取的是凹槽的x軸座標,就不須要遍歷全部y軸,遍歷幾個等分點就行
    for i in [10 * i for i in range(1,image.size[1]/11)]:
    	# x軸從x爲image.size[0]/5.16的像素點開始遍歷,由於凹槽不會在x軸爲50之內
        for j in range(image.size[0]/5.16, image.size[0] - int(image.size[0]/8.6)):
            if is_pixel_equal(image, j, i, left_list):
                break
     #其中(x, z)中的x爲凹槽左側的位置,z是count,就是從該x點座標起有多少連續像素點的R、G、B都是小於150的,由於咱們遍歷y軸,全部咱們的獲得幾個值,其中,z值最接近40的,結果最符合
    left_list = sorted(left_list, key=lambda x: abs(x[1]-40))
    #取第一個元素的x下標 最後結果 -7 或者 -14 通常 -7就能夠
    return left_list[0][0] - 7
    
def is_pixel_equal(image, x, y, left_list):
    """ 判斷兩個像素是否相同 :param image: 圖片 :param x: 位置x :param y: 位置y :return: 像素是否相同 """
    # 取圖片的像素點
    pixel1 = image.load()[x, y]
    threshold = 150
    # count記錄一次向右有多少個像素點R、G、B都是小於150的
    count = 0
    # 若是該點的R、G、B都小於150,就開始向右遍歷,記錄向右有多少個像素點R、G、B都是小於150的
    if pixel1[0] < threshold and pixel1[1] < threshold and pixel1[2] < threshold:
        for i in range(x + 1, image.size[0]):
            piexl = image.load()[i, y]
            if piexl[0] < threshold and piexl[1] < threshold and piexl[2] < threshold:
                count += 1
            else:
                break
    if int(image.size[0]/8.6) < count < int(image.size[0]/4.3):
        left_list.append((x, count))
        return True
    else:
        return False
複製代碼

  代碼都有明確註釋,靜下心來看的話很容易就能夠明白。   還有一個不錯的處理頁面表格的方法,上面的代碼裏已經有了,代碼以下:app

def getTableData(titleStr,selectorStr):
    table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
    tb_start = re.compile('')
    tb_end = re.compile('')
    last_str = tb_end.sub('', tb_start.sub('', table_string))
    #調用了pandas中的read_html方法,注意header=0,有些表格header不是0
    data = pd.read_html(last_str, flavor="bs4", header=0)[0]
    print(data)
    print(data.columns)
    return data
複製代碼

  titleStr爲瀏覽器標題,只要標題包含傳入的參數就能夠識別,selectorStr是css選擇器的選擇字符串,css選擇器是設計器原生支持的,自己這個東西在爬蟲方面也很重要,不懂的能夠自行百度,iie是他們本身的python庫裏的組件,能夠直接讀取已經打開的頁面的信息,使用這個方法傳入頁面table的位置,就能夠把表格轉化爲dataframe類型,不得不說,pandas仍是好用!ide

最終運行效果

  驗證碼運行效果,失敗了會本身重試,以下:

  總體運行效果,最後成功抓取了企業表格的數據錄入excel,以下:

在這裏插入圖片描述
  感謝您的觀看!
相關文章
相關標籤/搜索