python模擬用戶登陸爬取陽光采購平臺數據

原創內容,爬取請指明出處:http://www.javashuo.com/article/p-hlalvdso-g.htmlhtml

陽光采購平臺每個月初會把當月的價格掛到平臺上,現模擬用戶登陸平臺,將須要的數據保存到csv文件和數據庫,而且發送給指定人員。Python初學者,碰見不少坑,這裏記錄一下。python

環境 Python2.7
開發工具 PyCharm
運行環境 Centos7
運行說明 設置定時任務每個月1號凌晨1點執行這個python代碼
實現功能 根據帳號密碼及解析處理的驗證碼自動登陸系統,解析須要的數據,並保存在csv文件和mysql數據庫中,爬取完成後將csv文件發給指定的人。支持請求斷開後自動重連。

開發環境搭建:mysql

網上教程一大堆,不贅述了。安裝好後須要安裝一些必須的庫,以下:c++

bs4(頁面html解析)git

csv(用於保存csv文件)github

smtplib(用於發送郵件)web

mysql.connector(用於鏈接數據庫)sql

部分須要下載的內容我放在網盤共享,包括leptonica-1.72.tar.gz,Tesseract3.04.00.tar.gz,以及語言包:數據庫

連接:https://pan.baidu.com/s/1J4SZDgmn6DpuQ1EHxE6zkw
提取碼:crbl centos

圖像識別:

網上也有不少教程,整理了一版在centos7上能正常安裝圖像識別庫的操做步驟。

  • 由於是下載源碼編譯安裝,全部須要安裝響應的編譯工具:

yum install gcc gcc-c++ make

yum install autoconf automake libtool

  • 安裝對圖片識別相關支持工具,沒有這些在後續執行Tesseract命令時會報錯:

yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel

  • 安裝leptonica,首先去leptonica下載,下載後放到服務器解壓並編譯,leptonica是一個tesseract必須的工具:

下載地址:http://www.leptonica.org/

#到leptonica的目錄執行

./configure

make

make install

  • 下載對應的Tesseract

下載地址:https://link.jianshu.com/?t=https://github.com/tesseract-ocr/tesseract/wiki/Downloads

#到tesseract-3.04.00目錄執行

./autogen.sh

./configure

make

make install

ldconfig

  • 下載語言包

下載地址:https://github.com/tesseract-ocr/tessdata

下載後的文件放在目錄tessdata下面

  • 環境配置

拷貝tessdata:cp tessdata /usr/local/share –R

修改環境變量:

打開配置文件:vi /etc/profile

添加一行:export TESSDATA_PREFIX=/usr/local/share/tessdata

生效:source /etc/profile

  • 測試

tesseract –v 查看tesseract的版本相關信息。若是沒有報錯,那麼表示安裝成功了。

放入找到一張圖片image.png,而後執行:tesseract image.png 123

當前目錄下會生成123.txt文件,這個文件就記錄了識別的文字。

  • 安裝庫pytesseract

這個庫是用於在python代碼裏面調用tesseract

命令:pip install pytesseract

測試代碼以下:

1 import pytesseract
2 from PIL import Image
3  
4 im1=Image.open('image.png')
5 print(pytesseract.image_to_string(im1))

代碼:

我要獲取的數據長相以下:

首先獲取一共有多少頁,循環訪問每一頁,將每一頁數據保存到csv和數據庫裏面,若是在訪問某頁的時候拋出異常,那麼記錄當前broken頁數,從新登陸,從broken那頁繼續爬取數據。

 

寫了一個gl.py,用於保存全局變量:

 1 #!/usr/bin/python
 2 # -*- coding: utf-8 -*-
 3 #coding=utf-8
 4 import time
 5 
 6 timeStr = time.strftime('%Y%m%d', time.localtime(time.time()))
 7 monthStr = time.strftime('%m', time.localtime(time.time()))
 8 yearStr = time.strftime('%Y', time.localtime(time.time()))
 9 LOG_FILE = "log/" + timeStr + '.log'
10 csvFileName = "csv/" + timeStr + ".csv"
11 fileName = timeStr + ".csv"
12 fmt = '%(asctime)s - %(filename)s:%(lineno)s  - %(message)s'
13 loginUrl = "http://yourpath/Login.aspx"
14 productUrl = 'http://yourpath/aaa.aspx'
15 username = 'aaaa'
16 password = "aaa"
17 preCodeurl = "yourpath"
18 host="yourip"
19 user="aaa"
20 passwd="aaa"
21 db="mysql"
22 charset="utf8"
23 postData={
24             '__VIEWSTATE':'',
25             '__EVENTTARGET':'',
26             '__EVENTARGUMENT':'',
27             'btnLogin':"登陸",
28             'txtUserId':'aaaa',
29             'txtUserPwd':'aaa',
30             'txtCode':'',
31             'hfip':'yourip'
32             }
33 tdd={
34 '__VIEWSTATE':'',
35 '__EVENTTARGET':'ctl00$ContentPlaceHolder1$AspNetPager1',
36 'ctl00$ContentPlaceHolder1$AspNetPager1_input':'1',
37 'ctl00$ContentPlaceHolder1$AspNetPager1_pagesize':'50',
38 'ctl00$ContentPlaceHolder1$txtYear':'',
39 'ctl00$ContentPlaceHolder1$txtMonth':'',
40 '__EVENTARGUMENT':'',
41 }
42 vs={
43 '__VIEWSTATE':''
44 }

主代碼中設置日誌,csv,數據庫鏈接,cookie:

 1 handler = logging.handlers.RotatingFileHandler(gl.LOG_FILE, maxBytes=1024 * 1024, backupCount=5)
 2     formatter = logging.Formatter(gl.fmt)
 3     handler.setFormatter(formatter)
 4     logger = logging.getLogger('tst')
 5     logger.addHandler(handler)
 6     logger.setLevel(logging.DEBUG)
 7     csvFile = codecs.open(gl.csvFileName, 'w+', 'utf_8_sig')
 8     writer = csv.writer(csvFile)
 9     conn = mysql.connector.connect(host=gl.host, user=gl.user, passwd=gl.passwd, db=gl.db, charset=gl.charset)
10     cursor = conn.cursor()
11 
12     cookiejar = cookielib.MozillaCookieJar()
13     cookieSupport = urllib2.HTTPCookieProcessor(cookiejar)
14     httpsHandLer = urllib2.HTTPSHandler(debuglevel=0)
15     opener = urllib2.build_opener(cookieSupport, httpsHandLer)
16     opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')]
17     urllib2.install_opener(opener)

登陸方法:

首先是識別驗證碼,轉爲數字。而後用(密碼+用戶名+驗證)提交到登陸方法,可能會失敗,由於識別驗證碼有時候識別的不正確。若是登陸失敗,那麼從新獲取驗證碼,再次識別,再次登陸,直到登陸成功。

 1 def get_logined_Data(opener,logger,views):
 2     print "get_logined_Data"
 3     indexCount = 1
 4     retData = None
 5     while indexCount <= 15:
 6         print "begin login ", str(indexCount), " time"
 7         logger.info("begin login " + str(indexCount) + " time")
 8         vrifycodeUrl = gl.preCodeurl + str(random.random())
 9         text = get_image(vrifycodeUrl)#封裝一個方法,傳入驗證碼URL,返回識別出的數字
10         postData = gl.postData
11         postData["txtCode"] = text
12         postData["__VIEWSTATE"]=views
13 
14 
15         data = urllib.urlencode(postData)
16         try:
17             headers22 = {
18                 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
19                 'Accept-Encoding': 'gzip, deflate, br',
20                 'Accept-Language': 'zh-CN,zh;q=0.9',
21                 'Connection': 'keep-alive',
22                 'Content-Type': 'application/x-www-form-urlencoded',
23                 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36'
24             }
25             request = urllib2.Request(gl.loginUrl, data, headers22)
26             opener.open(request)
27         except Exception as e:
28             print "catch Exception when login"
29             print e
30 
31         request = urllib2.Request(gl.productUrl)
32         response = opener.open(request)
33         dataPage = response.read().decode('utf-8')
34 
35         bsObj = BeautifulSoup(dataPage,'html.parser')
36         tabcontent = bsObj.find(id="tabcontent") #登陸成功後,頁面纔有tabcontent這個元素,因此更具這個來判斷是否登陸成功
37         if (tabcontent is not None):
38             print "login succesfully"
39             logger.info("login succesfully")
40             retData = bsObj
41             break
42         else:
43             print "enter failed,try again"
44             logger.info("enter failed,try again")
45             time.sleep(3)
46             indexCount += 1
47     return retData

分析代碼發現,每次請求獲取數據都須要帶上’__VIEWSTATE’這個參數,這個參數是存放在頁面,因此須要把‘__VIEWSTATE’提出出來,用於訪問下一頁的時候帶到參數裏面去。

 

驗證碼解析:

經過驗證碼的url地址,將驗證碼保存到本地,由於驗證碼是彩色的,全部須要先把驗證碼置灰,而後再調用圖像識別轉爲數字。這個驗證碼爲4位數字,可是調用圖像識別的時候,可能會轉成字母,全部手動將字母轉爲數字,轉換後識別率還能接受。

 1 #獲取驗證碼對應的數字,返回爲4個數字才爲有效
 2 def get_image(codeurl):
 3     print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + " begin get code num")
 4     index = 1
 5     while index<=15:
 6         file = urllib2.urlopen(codeurl).read()
 7         im = cStringIO.StringIO(file)
 8         img = Image.open(im)
 9         imgName = "vrifycode/" + gl.timeStr + "_" + str(index) + ".png"
10         print 'begin get vrifycode'
11         text = convert_image(img, imgName)
12         print "vrifycode", index, ":", text
13         # logger.info('vrifycode' + str(index) + ":" + text)
14 
15         if (len(text) != 4 or text.isdigit() == False):  # 若是驗證碼不是4位那麼確定是錯誤的。
16             print 'vrifycode:', index, ' is wrong'
17             index += 1
18             time.sleep(2)
19             continue
20         return text
21 
22 #將圖片轉爲數字
23 def convert_image(image,impName):
24     print "enter convert_image"
25     image = image.convert('L')  # 灰度
26     image2 = Image.new('L', image.size, 255)
27     for x in range(image.size[0]):
28         for y in range(image.size[1]):
29             pix = image.getpixel((x, y))
30             if pix < 90:  # 灰度低於120 設置爲 0
31                 image2.putpixel((x, y), 0)
32     print "begin save"
33     image2.save(impName)  # 將灰度圖存儲下來看效果
34     print "begin convert"
35     text = pytesseract.image_to_string(image2)
36     print "end convert"
37     snum = ""
38     for j in text:#進行簡單轉換
39         if (j == 'Z'):
40             snum += "2"
41         elif (j == 'T'):
42             snum += "7"
43         elif (j == 'b'):
44             snum += "5"
45         elif (j == 's'):
46             snum += "8"
47         elif (j == 'S'):
48             snum += "8"
49         elif (j == 'O'):
50             snum += "0"
51         elif (j == 'o'):
52             snum += "0"
53         else:
54             snum += j
55     return snum

數據轉換:

將html數據轉換爲數組,供保存csv文件和數據庫時使用

 1 def paras_data(nameList,logger):
 2     data = []
 3     mainlist = nameList
 4     rows = mainlist.findAll("tr", {"class": {"row", "alter"}})
 5     try:
 6         if (len(rows) != 0):
 7             for name in rows:
 8                 tds = name.findAll("td")
 9                 if tds == None:
10                     print "get tds is null"
11                     logger.info("get tds is null")
12                 else:
13                     item = []
14                     for index in range(len(tds)):
15                         s_span = (tds[index]).find("span")
16                         if (s_span is not None):
17                             tmp = s_span["title"]
18                         else:
19                             tmp = (tds[index]).get_text()
20                         # tmp=(tds[index]).get_text()
21                         item.append(tmp.encode('utf-8'))  # gb2312  utf-8
22                     item.append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))#本條數據獲取時間
23                     data.append(tuple(item))
24 
25     except Exception as e:
26         print "catch exception when save csv", e
27         logger.info("catch exception when save csv" + e.message)
28     return data

保存csv文件:

def save_to_csv(data ,writer):
    for d in data:
        if d is not None:
            writer.writerow(d)

保存數據庫:

 1 def save_to_mysql(data,conn,cursor):
 2     try:
 3         cursor.executemany(
 4             "INSERT INTO `aaa`(aaa,bbb) VALUES (%s,%s)",
 5             data)
 6         conn.commit()
 7 
 8     except Exception as e:
 9         print "catch exception when save to mysql",e
10     else:
11         pass

保存指定頁數據:

 1 def get_appointed_page(snum,opener,vs,logger):
 2     tdd = get_tdd()
 3     tdd["__VIEWSTATE"] = vs['__VIEWSTATE']
 4     tdd["__EVENTARGUMENT"] = snum
 5     tdd=urllib.urlencode(tdd)
 6     # print "tdd",tdd
 7     op = opener.open(gl.productUrl, tdd)
 8     if (op.getcode() != 200):
 9         print("the" + snum + " page ,state not 200,try connect again")
10         return None
11     data = op.read().decode('utf-8', 'ignore')
12     # print "data",data
13     bsObj = BeautifulSoup(data,"lxml")
14     nameList = bsObj.find("table", {"class": "mainlist"})
15     # print "nameList",nameList
16     if len(nameList) == 0:
17         return None
18     viewState = bsObj.find(id="__VIEWSTATE")
19     if viewState is None:
20         logger.info("the other page,no viewstate,try connect again")
21         print("the other page,no viewstate,try connect again")
22         return None
23     vs['__VIEWSTATE'] = viewState["value"]
24     return nameList

Main方法:

 1 while flag == True and logintime <50:
 2             try:
 3                 print "global login the ", str(logintime), " times"
 4                 logger.info("global login the " + str(logintime) + " times")
 5                 bsObj = get_logined_Data(opener, logger,views)
 6                 if bsObj is None:
 7                     print "try login 15 times,but failed,exit"
 8                     logger.info("try login 15 times,but failed,exit")
 9                     exit()
10                 else:
11                     print "global login the ", str(logintime), " times succesfully!"
12                     logger.info("global login the " + str(logintime) + " times succesfully!")
13                     viewState_Source = bsObj.find(id="__VIEWSTATE")
14                     if totalNum == -1:
15                         totalNum = get_totalNum(bsObj)
16                         print "totalNum:",str(totalNum)
17                         logger.info("totalnum:"+str(totalNum))
18                     vs = gl.vs
19                     if viewState_Source != None:
20                         vs['__VIEWSTATE'] = viewState_Source["value"]
21 
22                     # 獲取指定snum頁的數據
23                     # while snum<=totalNum:
24                     while snum<=totalNum:
25                         print "begin get the ",str(snum)," page"
26                         logger.info("begin get the "+str(snum)+" page")
27                         nameList = get_appointed_page(snum, opener, vs, logger)
28                         if nameList is None:
29                             print "get the nameList failed,connect agian"
30                             logger.info("get the nameList failed,connect agian")
31                             raise Exception
32                         else:
33                             print "get the ", str(snum), " successfully"
34                             logger.info("get the " + str(snum) + " successfully")
35 
36                   
37                             mydata = paras_data(nameList,logger)
38                             #保存CSV文件
39                             save_to_csv(mydata, snum, writer)
40                             #保存到數據庫
41                             save_to_mysql(mydata, conn, cursor)
42 
43                             snum+=1
44                             time.sleep(3)
45 
46                 flag = False
47             except Exception as e:
48                 logintime+=1
49                 print "catch exception",e
50                 logger.error("catch exception"+e.message)

定時任務設置:

cd /var/spool/cron/

crontab –e#編輯定時任務

輸入:1 1 1 * * /yourpath/normal_script.sh>>/yourpath/cronlog.log  2>&1

(上面定時任務的意思是每個月1號1點1分執行文件normal_script.sh,日誌存放在cronlog.log)

目錄結構:

 源碼下載:helloworld.zip

相關文章
相關標籤/搜索