Python爬蟲—多線程的簡單示例

時間 2019-12-23

原文原文鏈接

python是支持多線程的，主要是經過thread和threading這兩個模塊來實現的。thread模塊是比較底層的模塊，threading模塊是對thread作了一些包裝的，能夠更加方便的使用。html

雖然python的多線程受GIL限制，並非真正的多線程，可是對於I/O密集型計算仍是能明顯提升效率，好比說爬蟲。詳細請見 https://www.zhihu.com/question/23474039 python

下面用一個實例來驗證多線程的效率。代碼只涉及頁面獲取，並無解析出來。
多線程

# -*-coding:utf-8 -*-
import urllib2, time
import threading


class MyThread(threading.Thread):
    def __init__(self, func, args):
        threading.Thread.__init__(self)
        self.args = args
        self.func = func

    def run(self):
        apply(self.func, self.args)


def open_url(url):
    request = urllib2.Request(url)
    html = urllib2.urlopen(request).read()
    print len(html)
    return html

if __name__ == '__main__':
    # 構造url列表
    urlList = []
    for p in range(1, 10):
        urlList.append('http://s.wanfangdata.com.cn/Paper.aspx?q=%E5%8C%BB%E5%AD%A6&p=' + str(p))
    
    # 通常方式
    n_start = time.time()
    for each in urlList:
        open_url(each)
    n_end = time.time()
    print 'the normal way take %s s' % (n_end-n_start)
    
    # 多線程
    t_start = time.time()
    threadList = [MyThread(open_url, (url,)) for url in urlList]
    for t in threadList:
        t.setDaemon(True)
        t.start()
    for i in threadList:
        i.join()
    t_end = time.time()
    print 'the thread way take %s s' % (t_end-t_start)

分別用兩種方式獲取10個訪問速度比較慢的網頁，通常方式耗時50s，多線程耗時10s。
app

多線程代碼解讀:
函數

# 建立線程類，繼承Thread類
class MyThread(threading.Thread):
    def __init__(self, func, args):
        threading.Thread.__init__(self)  # 調用父類的構造函數
        self.args = args
        self.func = func

    def run(self):  # 線程活動方法
        apply(self.func, self.args)

threadList = [MyThread(open_url, (url,)) for url in urlList]  # 調用線程類建立新線程，返回線程列表
    for t in threadList:
        t.setDaemon(True)  # 設置守護線程，父線程會等待子線程執行完後再退出
        t.start()  # 線程開啓
    for i in threadList:
        i.join()  # 等待線程終止，等子線程執行完後再執行父線程

python多線程參考教程：url

廖雪峯： http://www.liaoxuefeng.com/wiki/001374738125095c955c1e6d8bb493182103fac9270762a000/001386832360548a6491f20c62d427287739fcfa5d5be1f000 spa

菜鳥教程： http://www.runoob.com/python/python-multithreading.html .net

蟲師： http://www.cnblogs.com/fnng/p/3670789.html 線程

腳本之家： http://www.jb51.net/article/63784.htm code