【爬蟲】python+selenium+tesseract

簡介

最近工做中的爬蟲小知識，主要是python+selenium自動化截圖以及tesseract的驗證碼自動校驗（其實tesseract的正確率不好）。html

前期準備

1.安裝python環境，本身百度吧。python

2.安裝selenium，可用命令安裝：pip install seleniumweb

3.安裝pytesseract，一樣：pip install pytesseractchrome

4.安裝chromedriver.exe, 安裝教程：https://blog.csdn.net/wwwq2386466490/article/details/81513888瀏覽器

5.安裝tesseract.exe 教程：https://www.cnblogs.com/VseYoung/p/code.html 配置pytesseract：https://blog.csdn.net/u010134642/article/details/78747630微信

好多。。。接下來就是操做了。app

python+selenium 基本操做

下面的代碼步驟函數

python+selenium 啓動瀏覽器，而後輸入網址百度地圖的https://map.baidu.com/ ，並將瀏覽器最大化接着就是在搜索框中輸入關鍵詞」廣州塔」，點擊搜索按鈕，最後截圖保存到相應路徑。（這時候，我想起了「貪玩藍月」。。。）測試

  
    
  
  
  
   
   
            
   
   
   
     
   
   
   
    
    
             
    
    大數據
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
  
    
  
  
  
   
   
            
   
   
# -*- coding:utf-8 -*-from selenium import webdriverfrom time import sleepimport time ### 這是你上一步的chromedriver.exe的地址chrome_driver = 'C:/Users/zero/AppData/Local/Google/Chrome/Application/chromedriver.exe'# 時間格式進行格式化def time_format(): current_time = time.strftime('%Y%m%d%H%M%S', time.localtime(time.time())) return current_timedriver = webdriver.Chrome(executable_path=chrome_driver)driver.get('https://map.baidu.com/')driver.maximize_window()elem = driver.find_element_by_id("sole-input") ### 找到相應輸入框的idelem.send_keys("廣州塔")elem = driver.find_element_by_id("search-button") ### 找到相應按鈕的idelem.click()sleep(3)### 截全屏driver.get_screenshot_as_file("E:/crawl/"+time_format()+".png")sleep(2)driver.quit()

python+tesseract 操做

這個tesseract 驗證碼識別比較不許，不過既然用過了，那就介紹一下唄。

總體流程：

1.請求百度的找回密碼接口頁面 2.找到驗證碼對應的img節點，並截圖驗證碼 3.tesseract 進行灰度二值化等一系列圖片處理，返回識別出來的驗證碼 4.webdriver找到相應的頁面元素，輸入框填寫相應信息，而後點擊按鈕。

  
    
  
  
  
   
   
            
   
   
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    

   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
   
     
   
   
   
    
    
             
    
    
  
    
  
  
  
   
   
            
   
   
# coding:utf-8from selenium import webdriverfrom time import sleepimport unittestfrom PIL import Imagefrom PIL import ImageEnhanceimport pytesseractchrome_driver = 'C:/Users/zero/AppData/Local/Google/Chrome/Application/chromedriver.exe'driver = webdriver.Chrome(executable_path=chrome_driver)url="https://passport.baidu.com/?getpassindex"driver.get(url)driver.maximize_window()driver.save_screenshot(r"E:\crawl\aa.png") #截取當前網頁，該網頁有咱們須要的驗證碼imgelement = driver.find_element_by_xpath(".//*[@id='forgotsel']/div/div[3]/img")#imgelement = driver.find_element_by_id("code") #定位驗證碼location = imgelement.location #獲取驗證碼x,y軸座標size=imgelement.size #獲取驗證碼的長寬coderange=(int(location['x']),int(location['y']),int(location['x']+size['width']), int(location['y']+size['height'])) #寫成咱們須要截取的位置座標i=Image.open(r"E:\crawl\aa.png") #打開截圖frame4=i.crop(coderange) #使用Image的crop函數，從截圖中再次截取咱們須要的區域frame4.save(r"E:\crawl\frame4.png")i2=Image.open(r"E:\crawl\frame4.png")imgry = i2.convert('L') #圖像增強，二值化，PIL中有九種不一樣模式。分別爲1，L，P，RGB，RGBA，CMYK，YCbCr，I，F。L爲灰度圖像sharpness =ImageEnhance.Contrast(imgry)#對比度加強i3 = sharpness.enhance(3.0) #3.0爲圖像的飽和度i3.save("E:\crawl\image_code.png")i4=Image.open("E:\crawl\image_code.png")text=pytesseract.image_to_string(i2).strip() #使用image_to_string識別驗證碼print(text)elem = driver.find_element_by_id("account")elem.send_keys(13652878889)elem = driver.find_element_by_id("veritycode")elem.send_keys(text)sleep(2)elem = driver.find_element_by_id("submit")elem.click()

總結

1.人生苦短，我用python。

2.其實python+chrome的手機端一樣能夠解放雙手。

3.平時頁面代碼寫完有不少輸入框的那種，你能夠實現填完一次，之後就不用再填了，或許這就是自動化測試。。。

4.喜歡打遊戲的，刷怪什麼，能夠了解一下哦。

最後

若是對 Java、大數據感興趣請長按二維碼關注一波，我會努力帶給大家價值。以爲對你哪怕有一丁點幫助的請幫忙點個贊或者轉發哦。關注公衆號【愛編碼】，小編會一直更新文章的哦。

本文分享自微信公衆號 - 愛編碼（ilovecode）。
若有侵權，請聯繫 support@oschina.cn 刪除。
本文參與「OSC源創計劃」，歡迎正在閱讀的你也加入，一塊兒分享。