福利,三俗,喜聞樂見 html
爲了後面可以接上web界面,須要一個可隨時提供查詢和推送信息的結構。一開始,邏輯不用作得太複雜,也不用過多考慮性能問題,只要保證了這樣的結構,之後再修改也是能夠的。 python
要實現結構就要選取異步io操做的方式,這個問題讓我老頭疼了,本人用python編程的時間也不長,知道的路數很少,multiprocessing在windows下是用不了的,gevent安裝又太麻煩,須要多少第三方的東西,還須要編譯器,仍是硬着頭皮選擇用thread。內心有個架構,能夠保證前臺的流暢,但實現比較複雜,確定暫時不會考慮用在這個例子上,不過前面說了,只要結構作出來,之後若是有性能上的需求,放到linux上,到時候再修改也不遲。 linux
python3的新版本有個Future包,封裝了異步相關的操做,也算方便,並且就有現成的多線程下載例子 web
真是想人之所想啊,知道大家都喜歡這種;-) json
對每個網頁,給一個任務編號。page_x windows
對每個圖片,給一個任務編號。pic_x 多線程
此外,每個網頁有一個圖片列表。和網頁原地址 架構
此外,每個圖片有這幾個數據項。 app
狀態:0,未開始,1,排隊待下,2,下載完畢
所屬地址id
已下載bytes
總長bytes
若是這兩個值都爲0,表示未下載,
若是總長有值,表示下載中,已下載bytes表示它的進度
圖片原地址
通過一些嘗試,找到了下載的方法,下面是代碼,代碼是阻塞式實現的,先把過程調通,下一步再改爲非阻塞式。實際上如今這個版本已經能夠用了
# -*- coding: utf8 -*- import concurrent.futures import urllib.request import re import json import ast visitedURL = {} maxpageID = [0] pageDict = {} def createPageID(url): pID = maxpageID.pop()+1 maxpageID.append(pID) pID = 'page_'+str(pID) visitedURL[url] = pID return pID maxpicID = [0] picDict = {} def createPicID(url): pID = maxpicID.pop()+1 maxpicID.append(pID) pID = 'pic_'+str(pID) visitedURL[url] = pID return pID stoppedQueue = [] waitingQueue = [] downloadingQueue = [] savedDict = dict() #for page downloading pageTpe = concurrent.futures.ThreadPoolExecutor(max_workers=8) #for picture downloading picTpe = concurrent.futures.ThreadPoolExecutor(max_workers=4) def runMachine(): #add at least 4 tasks to download while waitingQueue: if len(downloadingQueue)<4: picID = waitingQueue.pop(0) processload(picID) def processload(picID): downloadingQueue.append(picID) #open conn,loading a picture picInfo = picDict[picID] url = picInfo['url'] filename = url.split('/')[-1] conn = urllib.request.urlopen(url,timeout=10) picInfo['total'] = int(conn.info().get('Content-Length').strip()) outputfile = open('pics/'+filename,'wb') picInfo['progress'] = 0 updateStatus(picInfo) while True: chunk = conn.read(4096) picInfo['progress']+=len(chunk) updateStatus(picInfo) if not chunk: picInfo['state'] = 2 downloadingQueue.remove(picID) savedDict[picID] = True updateStatus(picInfo) outputfile.close() conn.close() break outputfile.write(chunk) #reportProgress(url,progress,total) #report def updateStatus(picInfo): url = picInfo['url'] if picInfo['state']==2: print(url,'finished!') elif picInfo['total'] and picInfo['progress']: print('{} progress: {:.2%}'.format(url,(picInfo['progress']/picInfo['total']))) pass def log(*args): f = open('t.txt','ba') f.write((','.join(map(str,args))+'\n').encode('utf-8')) f.close() def load_pic(url,pageID): if url in visitedURL: return picID = createPicID(url) #狀態:0,未開始,1,排隊待下,2,下載完畢 picDict[picID] = {'url':url,'pageID':pageID,'total':0,'progress':0,'state':1} waitingQueue.append(picID) def load_page(url): if url in visitedURL: return pID = createPageID(url) pageDict[pID] = {'url':url,'links':None} conn = urllib.request.urlopen(url) text = conn.readall().decode('GBK').encode('utf-8').decode('utf-8') conn.close() try: startIndex = text.index('<div class="mod newslist clear">') endIndex = text.index('<div class="mod curPosition clear">',startIndex) text = text[startIndex:endIndex] patt = re.compile('href="([^"]+?).htm"><img', re.DOTALL | re.IGNORECASE) jsurls = [x+'.hdBigPic.js' for x in patt.findall(text)] pageurllist = [] for jsurl in jsurls: if jsurl in visitedURL: continue jsID = createPageID(jsurl) pageDict[jsID] = {'url':jsurl,'links':None} jslinks = [] try: conn = urllib.request.urlopen(jsurl) except BaseException as e: print('failed') continue try: text = conn.readall().decode('GBK').encode('utf-8').decode('utf-8') text = text[:text.index('/* |xGv00|')] obj = ast.literal_eval(text) picnum = int(obj['Children'][0]['Children'][0]['Children'][0]['Content']) picsobj = obj['Children'][0]['Children'][1]['Children'] for x in picsobj: picurl = x['Children'][2]['Children'][0]['Content'] jslinks.append(picurl) if jslinks: pageDict[jsID]['links'] = jslinks print(jsurl,'{} pics'.format(len(jslinks))) try: title = obj['Children'][0]['Children'][8]['Children'][0]['Content'] except: title = 'unknown' pageurllist.append(jsurl) for picurl in jslinks: load_pic(picurl,jsID) except BaseException as e: print(jsurl,'failed') raise e pageDict[pID]['links'] = pageurllist except ValueError as e: print('error',e) #can't find proper place pass runMachine() urls = ['http://games.qq.com/l/photo/gmcos/yxcos.htm'] load_page(urls[0])