我的公衆號:螺旋編程極客 >>期待您的關注css
最近公司有個新需求,大致流程是這樣的,進入天津市市場主體信用信息公示系統,根據excel中表格的企業名稱或稅號查詢企業的股東信息,查到以後獲取股東信息的稅號,而後再分別查詢股東的股東,最後把查詢結果錄入excel。 讀excel——>查詢企業股東——》獲取股東稅號——》輸入股東稅號查詢其股東——》查詢結果錄入excel,是否是讓人以爲十分無語,簡單一句話,查詢股東的股東的相關信息錄入excel,當時聽到這個需求感受理論上是能夠實現的,惟一的難點就在於滑塊驗證碼,破解了它以後後面的就是一些網頁數據提取的工做了。html
話很少說,上爬蟲唄,由於有滑塊驗證碼這個東西的存在,因此只能選擇瀏覽器爬蟲了,雖然效率慢點,可是萬物皆可爬,由於抓包分析那些請求數據實在是讓人噁心的想吐。在這裏我使用 「藝賽旗RPA設計器」 來輔助完成工做,不得不說,這個東西真的好用,並且它的python庫十分強大,設計完流程能夠自動生成python代碼,本身只須要關心一些核心的算法和業務邏輯就能夠了,事半功倍。 首先看一下驗證碼的圖片:python
是比較常見的 「極驗」 驗證碼,不少網站都在使用這個東西,可是政府的網站明顯落後了一點,如今 "極驗3.0" 已經更新了,這個還停留在2.0。區別就是2.0一開始顯示的是完整的圖片,點擊滑動按鈕會出現有缺口的圖片,而3.0一開始顯示的就是帶缺口的圖片,不過也是能夠破解的。web
在這裏咱們以2.0爲例,3.0的核心代碼我也會貼上,先看一下2.0破解的步驟:算法
點擊搜索-->截取驗證碼圖片-->點擊滑動按鈕-->截取帶缺口圖片-->比較像素計算偏移量-->移動
複製代碼
由於咱們使用了RPA設計器,因此像點擊鼠標,截圖之類的代碼都不須要本身去寫,選擇相應的元素,點擊對應頁面的元素,他就能夠自動爲咱們生成python代碼,固然是高度封裝的,源碼是能夠隨時看的,底層其實仍是那一套。惟一須要咱們動手寫的是計算偏移量以及鼠標移動,雖然他自己有鼠標拖動的組件,可是拖動的時候過於直來直去,會被檢測到,提示 「被怪物吃掉」 因此我稍微修改了一下他的源碼,封裝了一個本身的方法,先看一下驗證碼識別的流程圖: 編程
# coding=utf-8
# 編譯日期:2019-08-14 10:09:34
import time
import pdb
from ubpa.ilog import ILog
from ubpa.base_img import *
import getopt
from sys import argv
import sys
from ubpa.itools import rpa_import
GlobalFun = rpa_import.import_global_fun(__file__)
import ubpa.ibox as ibox
import ubpa.iexcel as iexcel
import ubpa.ifile as ifile
import ubpa.iie as iie
import ubpa.iimg as iimg
import ubpa.ikeyboard as ikeyboard
class getTjInfo:
def __init__(self,**kwargs):
self.__logger = ILog(__file__)
self.path = set_img_res_path(__file__)
self.robot_no = ''
self.proc_no = ''
self.job_no = ''
if('robot_no' in kwargs.keys()):
self.robot_no = kwargs['robot_no']
if('proc_no' in kwargs.keys()):
self.proc_no = kwargs['proc_no']
if('job_no' in kwargs.keys()):
self.job_no = kwargs['job_no']
#驗證碼識別
def checkCode(self):
existFlg=None
distance=None
xy=None
imageTwo=None
imageOne=None
# 截圖
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530184,Note:')
imageOne = iimg.capture_image(win_title=r'天津市市場主體',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 鼠標點擊
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530183,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(4)
# 截圖
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530186,Note:')
imageTwo = iimg.capture_image(win_title=r'天津市市場主體',win_text=r'',in_img_path=r'C:/Users/Administrator/Desktop/',left_indent=823,top_indent=521,width=266,height=121,waitfor=30)
# 自定義函數
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530182,Note:')
distance = GlobalFun.get_distance(imageOne,imageTwo)
# 獲取元素位置
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530181,Note:')
xy = iie.get_element_rect(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(2)',curson=r'center',waitfor=10)
# 代碼塊
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530180,Note:')
print(xy)
lastxy=(xy[0]+distance,xy[1],xy[2],xy[3])
print(lastxy)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(900.0, 682.0, 44, 44)):
print('修正')
lastxy=(895.0, 682.0, 44, 44)
if(xy==(847.0, 682.0, 44, 44) and lastxy==(976.0, 682.0, 44, 44)):
print('修正')
lastxy=(868.0, 682.0, 44, 44)
# 自定義函數
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530185,Note:')
GlobalFun.myDo_drag_to(win_title=r'天津市市場主體', srcpos=xy,distpos=lastxy)
#刪除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530187,Note:')
ifile.del_file(file=imageOne)
#刪除文件
self.__logger.debug('Flow:checkCode,StepNodeTag:1313475530188,Note:')
ifile.del_file(file=imageTwo)
# 圖像檢測
self.__logger.debug('Flow:checkCode,StepNodeTag:13140204311151,Note:')
time.sleep(3.5)
existFlg = iimg.img_exists(win_title=r'天津市市場主體',img_res_path=self.path,image=r'snapshot_20190813135330024.png',fuzzy=True,confidence=0.85,waitfor=3)
# IF分支
self.__logger.debug('Flow:checkCode,StepNodeTag:13140549531176,Note:')
if existFlg:
#消息框
self.__logger.debug('Flow:checkCode,StepNodeTag:13143951406201,Note:')
ibox.msg_box(msg='驗證失敗,重試!',timeout=1.5)
time.sleep(1)
# 鼠標點擊
self.__logger.debug('Flow:checkCode,StepNodeTag:13140738964184,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt*',selector=r'.gt_holder gt_popup gt_show > DIV:nth-of-type(2) > DIV:nth-of-type(2) > DIV:nth-of-type(1) > DIV:nth-of-type(3) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=100,scroll_view='no')
time.sleep(1.5)
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140556594179,Note:')
return True
else:
# Return返回
self.__logger.debug('Flow:checkCode,StepNodeTag:13140620186183,Note:')
return False
# 代碼塊
self.__logger.debug('Flow:checkCode,StepNodeTag:13141700326199,Note:')
print(existFlg)
#處理表格數據
def dealTableData(self,tableData=None):
currentCom=None
currentTableData=None
currentComName=None
# 代碼塊
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161341316275,Note:')
columns=tableData.columns
realDataList=tableData.values.tolist()
# 熱鍵輸入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161638010281,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市場主體',text='^{F4}',waitfor=10)
# 熱鍵輸入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13164935964357,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#消息框
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161731539284,Note:')
ibox.msg_box(msg='開始處理二級公司數據',timeout=2)
time.sleep(0.002)
# For循環
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161910442289,Note:')
for i in range(len(realDataList)):
# 代碼塊
self.__logger.debug('Flow:dealTableData,StepNodeTag:13162051185291,Note:')
currentList=realDataList[i]
if(columns[0]=='有限責任公司本年度是否有股權轉讓 '):
currentCom=currentList[0]
currentComName=currentList[0]
if(columns[0]=='企業是否有股權信息或購買其它公司股權'):
currentCom=currentList[0]
currentComName=currentList[1]
if("天津" not in currentCom):
continue
time.sleep(1)
# 子流程:finishCheckCode
self.__logger.debug('Flow:dealTableData,StepNodeTag:13161503452279,Note:')
self.finishCheckCode(comName=currentComName)
# 子流程:goToDetail
self.__logger.debug('Flow:dealTableData,StepNodeTag:13163507635351,Note:')
currentTableData=self.goToDetail()
# 代碼塊
self.__logger.debug('Flow:dealTableData,StepNodeTag:13170553111368,Note:')
currentTableData[0].drop(['變動後股權比例','股權變動日期'], axis=1)
currentTableData[1].drop(['投資設立企業後購買股權企業名稱',r'統一社會信用代碼/註冊號'], axis=1)
lastTableData0=currentTableData[0].values.tolist()
lastTableData1=currentTableData[1].values.tolist()
#插入行
self.__logger.debug('Flow:dealTableData,StepNodeTag:13171433395371,Note:')
iexcel.ins_row(path='C:/Users/Administrator/Desktop/testData.xlsx',data=lastTableData1)
# 熱鍵輸入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165826952364,Note:')
ikeyboard.key_send_cs(win_title=r'天津市市場主體',text='^{F4}',waitfor=10)
# 熱鍵輸入
self.__logger.debug('Flow:dealTableData,StepNodeTag:13165850702366,Note:')
ikeyboard.key_send_cs(text='^{F4}',waitfor=10)
#完成驗證
def finishCheckCode(self,comName='911200006630613577'):
#網站
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082917,Note:')
iie.open_url(url=r'http://credit.scjg.tj.gov.cn/gsxt/')
# 鼠標點擊
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082912,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=300,scroll_view='no')
time.sleep(0.5)
# 鼠標點擊
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:1308451082913,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/#',selector=r'http-equiv="x-ua-compatible":nth-of-type(1) > DIV:nth-of-type(3) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(2) > A:nth-of-type(1)',button=r'left',curson=r'center',offsetY=45,times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2)
# 設置文本
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108293,Note:')
iie.set_text(url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#searchName',text=comName,waitfor=10)
# 鼠標點擊
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:130845108292,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#entSearchLink',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=10,scroll_view='no')
time.sleep(2.5)
# While循環
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135310584131,Note:')
while True:
# 子流程:checkCode
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13140310308161,Note:')
tvar13140310308161=self.checkCode()
# IF分支
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135336032135,Note:')
if tvar13140310308161:
pass
else:
# Break中斷
self.__logger.debug('Flow:finishCheckCode,StepNodeTag:13135345176138,Note:')
break
#獲取股東公司信息
def getChildCom(self):
tableData2=None
tableData1=None
table2Columns=None
table1Columns=None
tableDatas=None
# 子流程:goToDetail
self.__logger.debug('Flow:getChildCom,StepNodeTag:13162921190325,Note:')
tableDatas=self.goToDetail()
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145633622210,Note:')
if tableDatas[0].columns[1]=='否':
pass
else:
# 代碼塊
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163242189338,Note:')
tableData1=tableDatas[0]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163127828333,Note:')
self.dealTableData(tableData=tableData1)
# IF分支
self.__logger.debug('Flow:getChildCom,StepNodeTag:13145822725214,Note:')
if tableDatas[1].columns[1]=='否':
pass
else:
# 代碼塊
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163302199339,Note:')
tableData2=tableDatas[1]
# 子流程:dealTableData
self.__logger.debug('Flow:getChildCom,StepNodeTag:13163131389335,Note:')
self.dealTableData(tableData=tableData2)
#消息框
self.__logger.debug('Flow:getChildCom,StepNodeTag:13151905124229,Note:')
ibox.msg_box(msg='當前企業數據處理完畢,下一個。。',timeout=1.5)
time.sleep(1.5)
#去往詳情頁
def goToDetail(self):
table2Data=None
table1Data=None
# 鼠標點擊
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929301,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#center_content > DIV:nth-of-type(1) > DIV:nth-of-type(2) > DIV:nth-of-type(2) > UL:nth-of-type(1) > LI:nth-of-type(1) > H1:nth-of-type(1) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=20,scroll_view='no')
time.sleep(1.5)
# 鼠標點擊
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929300,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tabs > DIV:nth-of-type(1) > DIV:nth-of-type(3) > SPAN:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 鼠標點擊
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929299,Note:')
iie.do_click_pos(win_title=r'天津市市場主體',url=r'http://credit.scjg.tj.gov.cn/gsxt/',selector=r'#tableInfoDiv > DIV:nth-of-type(2) > TABLE:nth-of-type(1) > TBODY:nth-of-type(1) > TR:nth-of-type(3) > TD:nth-of-type(4) > A:nth-of-type(1)',button=r'left',curson=r'center',times=1,run_mode=r'unctrl',waitfor=30,scroll_view='no')
# 自定義函數
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929298,Note:股權轉讓')
table1Data = GlobalFun.getTableData('年報詳情','#show_alter')
# 自定義函數
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162716929297,Note:是否有狗買')
table2Data = GlobalFun.getTableData('年報詳情','#show_invest')
# Return返回
self.__logger.debug('Flow:goToDetail,StepNodeTag:13162803390322,Note:')
return table1Data,table2Data
def Main(self):
# 子流程:finishCheckCode
self.__logger.debug('Flow:Main,StepNodeTag:13165330947360,Note:')
self.finishCheckCode(comName='911200006630613577')
# 子流程:getChildCom
self.__logger.debug('Flow:Main,StepNodeTag:13151828292226,Note:')
self.getChildCom()
#消息框
self.__logger.debug('Flow:Main,StepNodeTag:13152147770235,Note:')
ibox.msg_box(msg='所有數據處理完畢!')
if __name__ == '__main__':
robot_no = ''
proc_no = ''
job_no = ''
try:
argv = sys.argv[1:]
opts, args = getopt.getopt(argv,"hr:p:j:",["robot = ","proc = ","job = "])
except getopt.GetoptError:
print ('robot.py -r <robot> -p <proc> -j <job>')
for opt, arg in opts:
if opt == '-h':
print ('robot.py -r <robot> -p <proc> -j <job>')
elif opt in ("-r", "--robot"):
robot_no = arg
elif opt in ("-p", "--proc"):
proc_no = arg
elif opt in ("-j", "--job"):
job_no = arg
pro = getTjInfo(robot_no=robot_no,proc_no=proc_no,job_no=job_no)
pro.Main()
複製代碼
使用的全局函數的代碼,在這裏咱們須要引入PIL庫來進行圖片的讀取以及像素的處理,具體方法見 get_distance ,引入pyautogui庫來對瀏覽器頁面進行操做,在這裏主要用它控制鼠標滑動,具體方法見 myDo_drag_to 引入pandas庫來進行頁面表格的數據獲取,具體方法見 getTableData ,以下:瀏覽器
# 編譯日期:2019-08-12 10:47:48
# coding=utf-8
from selenium.webdriver.common.action_chains import ActionChains
from selenium import webdriver
import time
from PIL import Image
import ubpa.ics as ics
import pyautogui
from ubpa import iwin
import math
import ubpa.iie as iie
import re
import pandas as pd
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#調用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
def get_point_axis(axis_list,distpos,point):
pos_val_list = []
for i in range(1, 10000):
if i >= point:
break
n = len(axis_list) * (i / (point + 1))
pos_val = axis_list[int(n)]
pos_val_list.append(pos_val)
pos_val_list.append(distpos)
return pos_val_list
def get_axis_list(srcpos=(0, 0), distpos=(0, 0)):
pos_list = []
x1 = srcpos[0]
y1 = srcpos[1]
x2 = distpos[0]
y2 = distpos[1]
if x1 == x2:
if y1 > y2:
for i in range(math.ceil(y2), int(y1) + 1):
pos_list.append((x1, i))
pos_list.reverse()
elif y1 < y2:
for i in range(math.ceil(y1), int(y2) + 1):
pos_list.append((x1, i))
else:
pos_list = []
else:
if y1 == y2:
if x1 < x2:
x1 = math.ceil(x1)
x2 = int(x2)
length = x2 - x1
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
if x1 > x2:
x1 = int(x1)
x2 = math.ceil(x2)
length = x1 - x2
for i in range(0, length + 1):
pos_list.append((x1 + i, y2))
else:
if x1 < x2:
for i in range(math.ceil(x1), int(x2) + 1):
if y1 < y2:
h = (i - x1) * (y2 - y1) / (x2 - x1)
pos_list.append((i, y1 + h))
else:
h = (i - x1) * (y1 - y2) / (x2 - x1)
pos_list.append((i, y1 - h))
else:
for i in range(math.ceil(x2), int(x1) + 1):
if y1 < y2:
h = (i - x2) * (y2 - y1) / (x1 - x2)
pos_list.append((i, y2 - h))
else:
h = (i - x2) * (y1 - y2) / (x1 - x2)
pos_list.append((i, y2 + h))
pos_list.reverse()
return pos_list
def myDo_drag_to(win_title=None, srcpos=(0,0), distpos=(0,0), point=0, stimes=1, model=pyautogui.easeInOutQuad, waitfor=10):
''' 驗證拖拽 x1:起點位置x座標 y1:起點位置y座標 x2:終點位置x座標 y2:終點位置y座標 point:停頓次數,默認是0 stimes:移動快慢,默認是1 model:移動方式,easeInQuad先慢後快,easeOutQuad先快後慢,easeInOutQuad開始和結束快 中間慢,easeInBounce結束反彈,easeInElastic持續反彈 '''
try:
if win_title != None and win_title.strip() != '':
''''若是窗口不活躍狀態'''
if not iwin.do_win_is_active(win_title):
iwin.do_win_activate(win_title=win_title, waitfor=2)
pyautogui.moveTo(srcpos[0], srcpos[1], 0.5)
pyautogui.mouseDown(button='left', _pause=True)
axis_list = get_axis_list(srcpos, distpos)
if len(axis_list) > 0:
pos_val_list = get_point_axis(axis_list, distpos, point)
# print(pos_val_list)
for index in pos_val_list:
pyautogui.dragTo(float(index[0]+20), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]-5), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.dragTo(float(index[0]), float(index[1]), stimes, model)
time.sleep(0.5)
pyautogui.mouseUp(button='left', _pause=True)
except Exception as e:
raise e
# 2.0獲取偏移量
def get_distance(imageOne,imageTwo):
''' 拿到滑動驗證碼須要移動的距離 :param image1:沒有缺口的圖片對象 :param image2:帶缺口的圖片對象 :return:須要移動的距離 '''
threshold=150
left=60
image1 = Image.open(imageOne)
image2 = Image.open(imageTwo)
for i in range(left,image1.size[0]):
for j in range(image1.size[1]):
rgb1=image1.load()[i,j]
rgb2=image2.load()[i,j]
res1=abs(rgb1[0]-rgb2[0])
res2=abs(rgb1[1]-rgb2[1])
res3=abs(rgb1[2]-rgb2[2])
if not (res1 < threshold and res2 < threshold and res3 < threshold):
print(i-7)
return i-7 #通過測試,偏差爲大概爲7
print(i-7)
return i-7#通過測試,偏差爲大概爲7
複製代碼
以上代碼爲整個流程的代碼,我在這裏全貼出來了,3.0驗證碼破解的獲取偏移量方法以下:bash
#極驗3.0破解方法
def get_gap(image):
""" 獲取缺口偏移量 :param image: 帶缺口圖片 :return: """
# left_list保存全部符合條件的x軸座標
left_list = []
# 須要獲取的是凹槽的x軸座標,就不須要遍歷全部y軸,遍歷幾個等分點就行
for i in [10 * i for i in range(1,image.size[1]/11)]:
# x軸從x爲image.size[0]/5.16的像素點開始遍歷,由於凹槽不會在x軸爲50之內
for j in range(image.size[0]/5.16, image.size[0] - int(image.size[0]/8.6)):
if is_pixel_equal(image, j, i, left_list):
break
#其中(x, z)中的x爲凹槽左側的位置,z是count,就是從該x點座標起有多少連續像素點的R、G、B都是小於150的,由於咱們遍歷y軸,全部咱們的獲得幾個值,其中,z值最接近40的,結果最符合
left_list = sorted(left_list, key=lambda x: abs(x[1]-40))
#取第一個元素的x下標 最後結果 -7 或者 -14 通常 -7就能夠
return left_list[0][0] - 7
def is_pixel_equal(image, x, y, left_list):
""" 判斷兩個像素是否相同 :param image: 圖片 :param x: 位置x :param y: 位置y :return: 像素是否相同 """
# 取圖片的像素點
pixel1 = image.load()[x, y]
threshold = 150
# count記錄一次向右有多少個像素點R、G、B都是小於150的
count = 0
# 若是該點的R、G、B都小於150,就開始向右遍歷,記錄向右有多少個像素點R、G、B都是小於150的
if pixel1[0] < threshold and pixel1[1] < threshold and pixel1[2] < threshold:
for i in range(x + 1, image.size[0]):
piexl = image.load()[i, y]
if piexl[0] < threshold and piexl[1] < threshold and piexl[2] < threshold:
count += 1
else:
break
if int(image.size[0]/8.6) < count < int(image.size[0]/4.3):
left_list.append((x, count))
return True
else:
return False
複製代碼
代碼都有明確註釋,靜下心來看的話很容易就能夠明白。 還有一個不錯的處理頁面表格的方法,上面的代碼裏已經有了,代碼以下:app
def getTableData(titleStr,selectorStr):
table_string = iie.get_html(title=titleStr,selector=selectorStr,waitfor=30)
tb_start = re.compile('')
tb_end = re.compile('')
last_str = tb_end.sub('', tb_start.sub('', table_string))
#調用了pandas中的read_html方法,注意header=0,有些表格header不是0
data = pd.read_html(last_str, flavor="bs4", header=0)[0]
print(data)
print(data.columns)
return data
複製代碼
titleStr爲瀏覽器標題,只要標題包含傳入的參數就能夠識別,selectorStr是css選擇器的選擇字符串,css選擇器是設計器原生支持的,自己這個東西在爬蟲方面也很重要,不懂的能夠自行百度,iie是他們本身的python庫裏的組件,能夠直接讀取已經打開的頁面的信息,使用這個方法傳入頁面table的位置,就能夠把表格轉化爲dataframe類型,不得不說,pandas仍是好用!ide
驗證碼運行效果,失敗了會本身重試,以下: