抓妹子圖平臺的實現,大家就是喜歡這種東西,對吧?(之三)抓圖流程的實現

福利,三俗,喜聞樂見 html

爲了後面可以接上web界面,須要一個可隨時提供查詢和推送信息的結構。一開始,邏輯不用作得太複雜,也不用過多考慮性能問題,只要保證了這樣的結構,之後再修改也是能夠的。 python

要實現結構就要選取異步io操做的方式,這個問題讓我老頭疼了,本人用python編程的時間也不長,知道的路數很少,multiprocessing在windows下是用不了的,gevent安裝又太麻煩,須要多少第三方的東西,還須要編譯器,仍是硬着頭皮選擇用thread。內心有個架構,能夠保證前臺的流暢,但實現比較複雜,確定暫時不會考慮用在這個例子上,不過前面說了,只要結構作出來,之後若是有性能上的需求,放到linux上,到時候再修改也不遲。 linux

python3的新版本有個Future包,封裝了異步相關的操做,也算方便,並且就有現成的多線程下載例子 web

http://docs.python.org/3.3/library/concurrent.futures.html?highlight=future#concurrent.futures.ThreadPoolExecutor 編程

真是想人之所想啊,知道大家都喜歡這種;-) json

對每個網頁,給一個任務編號。page_x windows

對每個圖片,給一個任務編號。pic_x 多線程

此外,每個網頁有一個圖片列表。和網頁原地址 架構

此外,每個圖片有這幾個數據項。 app

狀態:0,未開始,1,排隊待下,2,下載完畢

所屬地址id

已下載bytes

總長bytes

若是這兩個值都爲0,表示未下載,

若是總長有值,表示下載中,已下載bytes表示它的進度

圖片原地址

通過一些嘗試,找到了下載的方法,下面是代碼,代碼是阻塞式實現的,先把過程調通,下一步再改爲非阻塞式。實際上如今這個版本已經能夠用了

# -*- coding: utf8 -*-
import concurrent.futures
import urllib.request
import re
import json
import ast


visitedURL = {}		
		
maxpageID = [0]
pageDict = {}

def createPageID(url):
	pID = maxpageID.pop()+1
	maxpageID.append(pID)
	pID = 'page_'+str(pID)
	visitedURL[url] = pID
	return pID


maxpicID = [0]
picDict = {}
def createPicID(url):
	pID = maxpicID.pop()+1
	maxpicID.append(pID)
	pID = 'pic_'+str(pID)
	visitedURL[url] = pID
	return pID



stoppedQueue = []
waitingQueue = []
downloadingQueue = []
savedDict = dict()

#for page downloading
pageTpe = concurrent.futures.ThreadPoolExecutor(max_workers=8)
#for picture downloading
picTpe = concurrent.futures.ThreadPoolExecutor(max_workers=4)


def runMachine():
    #add at least 4 tasks to download
    while waitingQueue:
        if len(downloadingQueue)<4:
            picID = waitingQueue.pop(0)
            processload(picID)

def processload(picID):
    downloadingQueue.append(picID)
    #open conn,loading a picture
    picInfo = picDict[picID]
    url = picInfo['url']
    filename = url.split('/')[-1]
    conn = urllib.request.urlopen(url,timeout=10)
    picInfo['total'] = int(conn.info().get('Content-Length').strip())
    outputfile = open('pics/'+filename,'wb')
    picInfo['progress'] = 0
    updateStatus(picInfo)
    while True:
        chunk = conn.read(4096)
        picInfo['progress']+=len(chunk)
        updateStatus(picInfo)
        if not chunk:
            picInfo['state'] = 2
            downloadingQueue.remove(picID)
            savedDict[picID] = True
            updateStatus(picInfo)
            outputfile.close()
            conn.close()
            break
        outputfile.write(chunk)
        #reportProgress(url,progress,total)
    #report

def updateStatus(picInfo):
    url = picInfo['url']
    if picInfo['state']==2:
        print(url,'finished!')
    elif  picInfo['total'] and picInfo['progress']:
        print('{} progress: {:.2%}'.format(url,(picInfo['progress']/picInfo['total'])))
        
    pass

def log(*args):
	f = open('t.txt','ba')
	f.write((','.join(map(str,args))+'\n').encode('utf-8'))
	f.close()

def load_pic(url,pageID):
    if url in visitedURL:
        return
    picID = createPicID(url)
    #狀態:0,未開始,1,排隊待下,2,下載完畢
    picDict[picID] = {'url':url,'pageID':pageID,'total':0,'progress':0,'state':1}
    waitingQueue.append(picID)
    

def load_page(url):
    if url in visitedURL:
            return
    pID = createPageID(url)
    pageDict[pID] = {'url':url,'links':None}
    conn = urllib.request.urlopen(url)
    text = conn.readall().decode('GBK').encode('utf-8').decode('utf-8')
    conn.close()
    try:
            startIndex = text.index('<div class="mod newslist clear">')
            endIndex = text.index('<div class="mod curPosition clear">',startIndex)
            text = text[startIndex:endIndex]
            patt = re.compile('href="([^"]+?).htm"><img', re.DOTALL | re.IGNORECASE)
            jsurls = [x+'.hdBigPic.js' for x in patt.findall(text)]
            pageurllist = []
            for jsurl in jsurls:
                    if jsurl in visitedURL:
                            continue
                    jsID = createPageID(jsurl)
                    pageDict[jsID] = {'url':jsurl,'links':None}
                    jslinks = []
                    try:
                            conn = urllib.request.urlopen(jsurl)
                    except BaseException as e:
                            print('failed')
                            continue
                    try:
                        text = conn.readall().decode('GBK').encode('utf-8').decode('utf-8')
                        text = text[:text.index('/*  |xGv00|')]
                        obj = ast.literal_eval(text)
                        picnum = int(obj['Children'][0]['Children'][0]['Children'][0]['Content'])
                        picsobj = obj['Children'][0]['Children'][1]['Children']
                        for x in picsobj:
                            picurl = x['Children'][2]['Children'][0]['Content']
                            jslinks.append(picurl)
                        if jslinks:
                            pageDict[jsID]['links'] = jslinks
                            print(jsurl,'{} pics'.format(len(jslinks)))
                        try:
                            title = obj['Children'][0]['Children'][8]['Children'][0]['Content']
                        except:
                            title = 'unknown'
                        pageurllist.append(jsurl)
                        for picurl in jslinks:
                            load_pic(picurl,jsID)
                    except BaseException as e:
                            print(jsurl,'failed')
                            raise e
            pageDict[pID]['links'] = pageurllist
            
                    
    except ValueError as e:
            print('error',e)
            #can't find proper place
            pass
    runMachine()
	
urls = ['http://games.qq.com/l/photo/gmcos/yxcos.htm']

load_page(urls[0])
相關文章
相關標籤/搜索