linux多線程網頁截圖-python

時間 2019-11-08

原文原文鏈接

上一篇中（ linux多線程網頁截圖-shell ），使用shell多進程對大量的網站截圖，大大減小了截圖的時間。但慢慢的也發現了這種方式的弊端：每一個進程分配的網站數量是相等的，有些進程截圖較快，有些較慢，個別進程在其它進程已經截圖完成後，還要運行10多個小時才能把分配的網站截圖完。
如何把現有的「平均分配」截圖方式改爲「能者多勞」呢？

恰好最近在學習python，而python能夠很方便的支持多線程。找了些資料，使用threading+queue的方式實現了「能者多勞」的多線程截圖方式： php

#coding:utf-8
import threading,urllib2
import datetime,time
import Queue
import os
 
class Webshot(threading.Thread):
        def __init__(self,queue):
                threading.Thread.__init__(self)
                self.queue=queue
 
        def run(self):
                while True:
                       #若是隊列爲空，則退出，不然從隊列中取出一條網址數據，並截圖
                        if self.queue.empty():
                                break
                        host=self.queue.get().strip('\n')
                        shotcmd="DISPLAY=:0 cutycapt --url=http://"+host+" --max-wait=90000 --out="+host+".jpg"
                        os.system(shotcmd)
                        self.queue.task_done()
                        time.sleep(1)
 
def main():
        queue=Queue.Queue()
        f=file('domain.txt','r')
 
     ＃往隊列中填充數據
        while True:
                line=f.readline()
                if len(line)==0:
                        break
                queue.put(line)
 
      #生成一個 threads pool, 並把隊列傳遞給thread函數進行處理，這裏開啓10個線程併發
        for i in range(0,10):
                shot=Webshot(queue)
                shot.start()
 
if __name__=="__main__":
        main()

程序描述以下：
一、建立一個Queue.Queue() 的實例,將domain.txt裏的網站列表存入到該隊列中
二、for循環生成10個線程併發
三、將隊列實例傳遞給線程類Webshot，後者是經過繼承 threading.Thread 的方式建立的
四、每次從隊列中取出一個項目，並使用該線程中的數據和 run 方法以執行相應的工做
五、在完成這項工做以後，使用 queue.task_done() 函數向任務已經完成的隊列發送一個信號 html

參考：
Python：使用threading模塊實現多線程（轉）
http://bkeep.blog.163.com/blog/static/1234142902012112210717682/
http://fc-lamp.blog.163.com/blog/static/17456668720127221363513/
http://www.pythonclub.org/python-network-application/observer-spider
http://www.phpno.com/python-threading-2.html python