原創內容,爬取請指明出處:http://www.javashuo.com/article/p-hlalvdso-g.htmlhtml
陽光采購平臺每個月初會把當月的價格掛到平臺上,現模擬用戶登陸平臺,將須要的數據保存到csv文件和數據庫,而且發送給指定人員。Python初學者,碰見不少坑,這裏記錄一下。python
環境 | Python2.7 |
開發工具 | PyCharm |
運行環境 | Centos7 |
運行說明 | 設置定時任務每個月1號凌晨1點執行這個python代碼 |
實現功能 | 根據帳號密碼及解析處理的驗證碼自動登陸系統,解析須要的數據,並保存在csv文件和mysql數據庫中,爬取完成後將csv文件發給指定的人。支持請求斷開後自動重連。 |
開發環境搭建:mysql
網上教程一大堆,不贅述了。安裝好後須要安裝一些必須的庫,以下:c++
bs4(頁面html解析)git
csv(用於保存csv文件)github
smtplib(用於發送郵件)web
mysql.connector(用於鏈接數據庫)sql
部分須要下載的內容我放在網盤共享,包括leptonica-1.72.tar.gz,Tesseract3.04.00.tar.gz,以及語言包:數據庫
連接:https://pan.baidu.com/s/1J4SZDgmn6DpuQ1EHxE6zkw
提取碼:crbl centos
圖像識別:
網上也有不少教程,整理了一版在centos7上能正常安裝圖像識別庫的操做步驟。
yum install gcc gcc-c++ make
yum install autoconf automake libtool
yum install libjpeg-devel libpng-devel libtiff-devel zlib-devel
下載地址:http://www.leptonica.org/
#到leptonica的目錄執行
./configure
make
make install
下載地址:https://link.jianshu.com/?t=https://github.com/tesseract-ocr/tesseract/wiki/Downloads
#到tesseract-3.04.00目錄執行
./autogen.sh
./configure
make
make install
ldconfig
下載地址:https://github.com/tesseract-ocr/tessdata
下載後的文件放在目錄tessdata下面
拷貝tessdata:cp tessdata /usr/local/share –R
修改環境變量:
打開配置文件:vi /etc/profile
添加一行:export TESSDATA_PREFIX=/usr/local/share/tessdata
生效:source /etc/profile
tesseract –v 查看tesseract的版本相關信息。若是沒有報錯,那麼表示安裝成功了。
放入找到一張圖片image.png,而後執行:tesseract image.png 123
當前目錄下會生成123.txt文件,這個文件就記錄了識別的文字。
這個庫是用於在python代碼裏面調用tesseract
命令:pip install pytesseract
測試代碼以下:
1 import pytesseract 2 from PIL import Image 3 4 im1=Image.open('image.png') 5 print(pytesseract.image_to_string(im1))
代碼:
我要獲取的數據長相以下:
首先獲取一共有多少頁,循環訪問每一頁,將每一頁數據保存到csv和數據庫裏面,若是在訪問某頁的時候拋出異常,那麼記錄當前broken頁數,從新登陸,從broken那頁繼續爬取數據。
寫了一個gl.py,用於保存全局變量:
1 #!/usr/bin/python 2 # -*- coding: utf-8 -*- 3 #coding=utf-8 4 import time 5 6 timeStr = time.strftime('%Y%m%d', time.localtime(time.time())) 7 monthStr = time.strftime('%m', time.localtime(time.time())) 8 yearStr = time.strftime('%Y', time.localtime(time.time())) 9 LOG_FILE = "log/" + timeStr + '.log' 10 csvFileName = "csv/" + timeStr + ".csv" 11 fileName = timeStr + ".csv" 12 fmt = '%(asctime)s - %(filename)s:%(lineno)s - %(message)s' 13 loginUrl = "http://yourpath/Login.aspx" 14 productUrl = 'http://yourpath/aaa.aspx' 15 username = 'aaaa' 16 password = "aaa" 17 preCodeurl = "yourpath" 18 host="yourip" 19 user="aaa" 20 passwd="aaa" 21 db="mysql" 22 charset="utf8" 23 postData={ 24 '__VIEWSTATE':'', 25 '__EVENTTARGET':'', 26 '__EVENTARGUMENT':'', 27 'btnLogin':"登陸", 28 'txtUserId':'aaaa', 29 'txtUserPwd':'aaa', 30 'txtCode':'', 31 'hfip':'yourip' 32 } 33 tdd={ 34 '__VIEWSTATE':'', 35 '__EVENTTARGET':'ctl00$ContentPlaceHolder1$AspNetPager1', 36 'ctl00$ContentPlaceHolder1$AspNetPager1_input':'1', 37 'ctl00$ContentPlaceHolder1$AspNetPager1_pagesize':'50', 38 'ctl00$ContentPlaceHolder1$txtYear':'', 39 'ctl00$ContentPlaceHolder1$txtMonth':'', 40 '__EVENTARGUMENT':'', 41 } 42 vs={ 43 '__VIEWSTATE':'' 44 }
主代碼中設置日誌,csv,數據庫鏈接,cookie:
1 handler = logging.handlers.RotatingFileHandler(gl.LOG_FILE, maxBytes=1024 * 1024, backupCount=5) 2 formatter = logging.Formatter(gl.fmt) 3 handler.setFormatter(formatter) 4 logger = logging.getLogger('tst') 5 logger.addHandler(handler) 6 logger.setLevel(logging.DEBUG) 7 csvFile = codecs.open(gl.csvFileName, 'w+', 'utf_8_sig') 8 writer = csv.writer(csvFile) 9 conn = mysql.connector.connect(host=gl.host, user=gl.user, passwd=gl.passwd, db=gl.db, charset=gl.charset) 10 cursor = conn.cursor() 11 12 cookiejar = cookielib.MozillaCookieJar() 13 cookieSupport = urllib2.HTTPCookieProcessor(cookiejar) 14 httpsHandLer = urllib2.HTTPSHandler(debuglevel=0) 15 opener = urllib2.build_opener(cookieSupport, httpsHandLer) 16 opener.addheaders = [('User-Agent','Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.11 (KHTML, like Gecko) Chrome/23.0.1271.64 Safari/537.11')] 17 urllib2.install_opener(opener)
登陸方法:
首先是識別驗證碼,轉爲數字。而後用(密碼+用戶名+驗證)提交到登陸方法,可能會失敗,由於識別驗證碼有時候識別的不正確。若是登陸失敗,那麼從新獲取驗證碼,再次識別,再次登陸,直到登陸成功。
1 def get_logined_Data(opener,logger,views): 2 print "get_logined_Data" 3 indexCount = 1 4 retData = None 5 while indexCount <= 15: 6 print "begin login ", str(indexCount), " time" 7 logger.info("begin login " + str(indexCount) + " time") 8 vrifycodeUrl = gl.preCodeurl + str(random.random()) 9 text = get_image(vrifycodeUrl)#封裝一個方法,傳入驗證碼URL,返回識別出的數字 10 postData = gl.postData 11 postData["txtCode"] = text 12 postData["__VIEWSTATE"]=views 13 14 15 data = urllib.urlencode(postData) 16 try: 17 headers22 = { 18 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3', 19 'Accept-Encoding': 'gzip, deflate, br', 20 'Accept-Language': 'zh-CN,zh;q=0.9', 21 'Connection': 'keep-alive', 22 'Content-Type': 'application/x-www-form-urlencoded', 23 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/74.0.3729.131 Safari/537.36' 24 } 25 request = urllib2.Request(gl.loginUrl, data, headers22) 26 opener.open(request) 27 except Exception as e: 28 print "catch Exception when login" 29 print e 30 31 request = urllib2.Request(gl.productUrl) 32 response = opener.open(request) 33 dataPage = response.read().decode('utf-8') 34 35 bsObj = BeautifulSoup(dataPage,'html.parser') 36 tabcontent = bsObj.find(id="tabcontent") #登陸成功後,頁面纔有tabcontent這個元素,因此更具這個來判斷是否登陸成功 37 if (tabcontent is not None): 38 print "login succesfully" 39 logger.info("login succesfully") 40 retData = bsObj 41 break 42 else: 43 print "enter failed,try again" 44 logger.info("enter failed,try again") 45 time.sleep(3) 46 indexCount += 1 47 return retData
分析代碼發現,每次請求獲取數據都須要帶上’__VIEWSTATE’這個參數,這個參數是存放在頁面,因此須要把‘__VIEWSTATE’提出出來,用於訪問下一頁的時候帶到參數裏面去。
驗證碼解析:
經過驗證碼的url地址,將驗證碼保存到本地,由於驗證碼是彩色的,全部須要先把驗證碼置灰,而後再調用圖像識別轉爲數字。這個驗證碼爲4位數字,可是調用圖像識別的時候,可能會轉成字母,全部手動將字母轉爲數字,轉換後識別率還能接受。
1 #獲取驗證碼對應的數字,返回爲4個數字才爲有效 2 def get_image(codeurl): 3 print(time.strftime('%Y-%m-%d %H:%M:%S',time.localtime(time.time())) + " begin get code num") 4 index = 1 5 while index<=15: 6 file = urllib2.urlopen(codeurl).read() 7 im = cStringIO.StringIO(file) 8 img = Image.open(im) 9 imgName = "vrifycode/" + gl.timeStr + "_" + str(index) + ".png" 10 print 'begin get vrifycode' 11 text = convert_image(img, imgName) 12 print "vrifycode", index, ":", text 13 # logger.info('vrifycode' + str(index) + ":" + text) 14 15 if (len(text) != 4 or text.isdigit() == False): # 若是驗證碼不是4位那麼確定是錯誤的。 16 print 'vrifycode:', index, ' is wrong' 17 index += 1 18 time.sleep(2) 19 continue 20 return text 21 22 #將圖片轉爲數字 23 def convert_image(image,impName): 24 print "enter convert_image" 25 image = image.convert('L') # 灰度 26 image2 = Image.new('L', image.size, 255) 27 for x in range(image.size[0]): 28 for y in range(image.size[1]): 29 pix = image.getpixel((x, y)) 30 if pix < 90: # 灰度低於120 設置爲 0 31 image2.putpixel((x, y), 0) 32 print "begin save" 33 image2.save(impName) # 將灰度圖存儲下來看效果 34 print "begin convert" 35 text = pytesseract.image_to_string(image2) 36 print "end convert" 37 snum = "" 38 for j in text:#進行簡單轉換 39 if (j == 'Z'): 40 snum += "2" 41 elif (j == 'T'): 42 snum += "7" 43 elif (j == 'b'): 44 snum += "5" 45 elif (j == 's'): 46 snum += "8" 47 elif (j == 'S'): 48 snum += "8" 49 elif (j == 'O'): 50 snum += "0" 51 elif (j == 'o'): 52 snum += "0" 53 else: 54 snum += j 55 return snum
數據轉換:
將html數據轉換爲數組,供保存csv文件和數據庫時使用
1 def paras_data(nameList,logger): 2 data = [] 3 mainlist = nameList 4 rows = mainlist.findAll("tr", {"class": {"row", "alter"}}) 5 try: 6 if (len(rows) != 0): 7 for name in rows: 8 tds = name.findAll("td") 9 if tds == None: 10 print "get tds is null" 11 logger.info("get tds is null") 12 else: 13 item = [] 14 for index in range(len(tds)): 15 s_span = (tds[index]).find("span") 16 if (s_span is not None): 17 tmp = s_span["title"] 18 else: 19 tmp = (tds[index]).get_text() 20 # tmp=(tds[index]).get_text() 21 item.append(tmp.encode('utf-8')) # gb2312 utf-8 22 item.append(datetime.datetime.now().strftime('%Y-%m-%d %H:%M:%S'))#本條數據獲取時間 23 data.append(tuple(item)) 24 25 except Exception as e: 26 print "catch exception when save csv", e 27 logger.info("catch exception when save csv" + e.message) 28 return data
保存csv文件:
def save_to_csv(data ,writer): for d in data: if d is not None: writer.writerow(d)
保存數據庫:
1 def save_to_mysql(data,conn,cursor): 2 try: 3 cursor.executemany( 4 "INSERT INTO `aaa`(aaa,bbb) VALUES (%s,%s)", 5 data) 6 conn.commit() 7 8 except Exception as e: 9 print "catch exception when save to mysql",e 10 else: 11 pass
保存指定頁數據:
1 def get_appointed_page(snum,opener,vs,logger): 2 tdd = get_tdd() 3 tdd["__VIEWSTATE"] = vs['__VIEWSTATE'] 4 tdd["__EVENTARGUMENT"] = snum 5 tdd=urllib.urlencode(tdd) 6 # print "tdd",tdd 7 op = opener.open(gl.productUrl, tdd) 8 if (op.getcode() != 200): 9 print("the" + snum + " page ,state not 200,try connect again") 10 return None 11 data = op.read().decode('utf-8', 'ignore') 12 # print "data",data 13 bsObj = BeautifulSoup(data,"lxml") 14 nameList = bsObj.find("table", {"class": "mainlist"}) 15 # print "nameList",nameList 16 if len(nameList) == 0: 17 return None 18 viewState = bsObj.find(id="__VIEWSTATE") 19 if viewState is None: 20 logger.info("the other page,no viewstate,try connect again") 21 print("the other page,no viewstate,try connect again") 22 return None 23 vs['__VIEWSTATE'] = viewState["value"] 24 return nameList
Main方法:
1 while flag == True and logintime <50: 2 try: 3 print "global login the ", str(logintime), " times" 4 logger.info("global login the " + str(logintime) + " times") 5 bsObj = get_logined_Data(opener, logger,views) 6 if bsObj is None: 7 print "try login 15 times,but failed,exit" 8 logger.info("try login 15 times,but failed,exit") 9 exit() 10 else: 11 print "global login the ", str(logintime), " times succesfully!" 12 logger.info("global login the " + str(logintime) + " times succesfully!") 13 viewState_Source = bsObj.find(id="__VIEWSTATE") 14 if totalNum == -1: 15 totalNum = get_totalNum(bsObj) 16 print "totalNum:",str(totalNum) 17 logger.info("totalnum:"+str(totalNum)) 18 vs = gl.vs 19 if viewState_Source != None: 20 vs['__VIEWSTATE'] = viewState_Source["value"] 21 22 # 獲取指定snum頁的數據 23 # while snum<=totalNum: 24 while snum<=totalNum: 25 print "begin get the ",str(snum)," page" 26 logger.info("begin get the "+str(snum)+" page") 27 nameList = get_appointed_page(snum, opener, vs, logger) 28 if nameList is None: 29 print "get the nameList failed,connect agian" 30 logger.info("get the nameList failed,connect agian") 31 raise Exception 32 else: 33 print "get the ", str(snum), " successfully" 34 logger.info("get the " + str(snum) + " successfully") 35 36 37 mydata = paras_data(nameList,logger) 38 #保存CSV文件 39 save_to_csv(mydata, snum, writer) 40 #保存到數據庫 41 save_to_mysql(mydata, conn, cursor) 42 43 snum+=1 44 time.sleep(3) 45 46 flag = False 47 except Exception as e: 48 logintime+=1 49 print "catch exception",e 50 logger.error("catch exception"+e.message)
定時任務設置:
cd /var/spool/cron/
crontab –e#編輯定時任務
輸入:1 1 1 * * /yourpath/normal_script.sh>>/yourpath/cronlog.log 2>&1
(上面定時任務的意思是每個月1號1點1分執行文件normal_script.sh,日誌存放在cronlog.log)
目錄結構:
源碼下載:helloworld.zip