python數據採集與多線程效率分析

時間 2019-11-11

原文原文鏈接

之前一直使用PHP寫爬蟲，用Snoopy配合simple_html_dom用起來也挺好的，至少可以解決問題。html

PHP一直沒有一個好用的多線程機制，雖然可使用一些trick的手段來實現並行的效果（例如藉助apache或者nginx服務器等，或者fork一個子進程，或者直接動態生成多個PHP腳本多進程運行），可是不管從代碼結構上，仍是從使用的複雜程度上，用起來都不是那麼順手。還據說過一個pthreads的PHP的擴展，這是一個真正可以實現PHP多線程的擴展，看github上它的介紹：Absolutely, this is not a hack, we don't use forking or any other such nonsense, what you create are honest to goodness posix threads that are completely compatible with PHP and safe ... this is true multi-threading :)python

扯遠了，PHP的內容在本文中再也不贅述，既然決定嘗試一下Python的採集，同時必定要學習一下Python的多線程知識的。之前一直聽各類大牛們將Python有多麼多麼好用，不真正用一次試試，本身也無法明確Python具體的優點在哪，處理哪些問題用Python合適。nginx

廢話就說這麼多吧，進入正題git

採集目標：淘寶
採集數據：某一關鍵詞領域的淘寶店鋪名稱、URL地址、店鋪等級
用到的第三方packages：
- requests（話說是看了前兩天的一篇文章 Python modules you should know (傳送門) 才知道的，之前只知道urllib2）
- BeautifulSoup（如今貌似有新版本bs4了，不過我用的是舊版本的）
- Redis

採集

單線程版本

代碼：

search_config.pygithub

1 #!/usr/bin/env python
2 # coding=utf-8
3 class config:
4     keyword = '青島'
5     search_type = 'shop'
6     url = 'http://s.taobao.com/search?q=' + keyword + '&commend=all&search_type='+ search_type +'&sourceId=tb.index&initiative_id=tbindexz_20131207&app=shopsearch&s='

single_scrapy.pyredis

#!/usr/bin/env python
# coding=utf-8
import requests
from search_config import config

class Scrapy():

    def __init__(self, threadname, start_num):
        self.threadname = threadname
        self.start_num = start_num
        print threadname + 'start.....'


    def run(self):
        url = config.url + self.start_num
        response = requests.get(url)
        print self.threadname + 'end......'

def main():
    for i in range(0,13,6):
        scrapy = Scrapy('scrapy', str(i))
        scrapy.run()


if __name__ == '__main__':
    main()

運行分析：

這是最簡單最常規的一種採集方式，按照順序循環進行網絡鏈接，獲取頁面信息。看截圖可知，這種方式的效率實際上是極低的，一個鏈接進行網絡I/O的時候，其餘的必須等待前面的鏈接完成才能進行鏈接，換句話說，就是前面的鏈接阻塞的後面的鏈接。apache

多線程版本

代碼：

#!/usr/bin/env python
# coding=utf-8
import requests
from search_config import config
import threading

class Scrapy(threading.Thread):

    def __init__(self, threadname, start_num):
        threading.Thread.__init__(self, name = threadname)
        self.threadname = threadname
        self.start_num = start_num
        print threadname + 'start.....'

    #重寫Thread類的run方法
    def run(self):
        url = config.url + self.start_num
        response = requests.get(url)
        print self.threadname + 'end......'

def main():
    for i in range(0,13,6):
        scrapy = Scrapy('scrapy', str(i))
        scrapy.start()


if __name__ == '__main__':
    main()

運行分析：

經過截圖能夠看到，採集一樣數量的頁面，經過開啓多線程，時間縮短了不少，可是CPU利用率高了。安全

頁面信息解析

html頁面信息拿到之後，咱們須要對其進行解析操做，從中提取出咱們所須要的信息，包含：服務器

店鋪名稱
店鋪URL
店鋪等級

使用BeautifulSoup這個庫，能夠直接按照class或者id等html的attr來進行提取，比直接寫正則直觀很多，難度也小了不少，固然，執行效率上，相應的也就大打折扣了。網絡

代碼：

這裏使用Queue實現一個生產者和消費者模式
- 生產者消費者模式：
  - 生產者將數據依次存入隊列，消費者依次從隊列中取出數據。
  - 本例中，經過scrapy線程不斷提供數據，parse線程從隊列中取出數據進行相應解析
Queue模塊
- Python中的Queue對象也提供了對線程同步的支持，使用Queue對象能夠實現多個生產者和多個消費者造成的FIFO的隊列。
- 當共享信息須要安全的在多線程之間交換時，Queue很是有用。
- Queue的默認長度是無限的，可是能夠設置其構造函數的maxsize參數來設定其長度。

#!/usr/bin/env python
# coding=utf-8
import requests
from BeautifulSoup import BeautifulSoup
from search_config import config

from Queue import Queue
import threading

class Scrapy(threading.Thread):

    def __init__(self, threadname, queue, out_queue):
        threading.Thread.__init__(self, name = threadname)
        self.sharedata = queue
        self.out_queue= out_queue
        self.threadname = threadname
        print threadname + 'start.....'


    def run(self):
        url = config.url + self.sharedata.get()
        response = requests.get(url)
        self.out_queue.put(response)
        print self.threadname + 'end......'

class Parse(threading.Thread):
    def __init__(self, threadname, queue, out_queue):
        threading.Thread.__init__(self, name = threadname)
        self.sharedata = queue
        self.out_queue= out_queue
        self.threadname = threadname
        print threadname + 'start.....'

    def run(self):
        response = self.sharedata.get()
        body = response.content.decode('gbk').encode('utf-8')
        soup = BeautifulSoup(body)
        ul_html = soup.find('ul',{'id':'list-container'})
        lists = ul_html.findAll('li',{'class':'list-item'})
        stores = []
        for list in lists:
            store= {}
            try:
                infos = list.findAll('a',{'trace':'shop'})
                for info in infos:
                    attrs = dict(info.attrs)
                    if attrs.has_key('class'):
                        if 'rank' in attrs['class']:
                            rank_string = attrs['class']
                            rank_num = rank_string[-2:]
                            if (rank_num[0] == '-'):
                                store['rank'] = rank_num[-1]
                            else:
                                store['rank'] = rank_num
                    if attrs.has_key('title'):
                        store['title'] = attrs['title']
                        store['href'] = attrs['href']
            except AttributeError:
                pass
            if store:
                stores.append(store)

        for store in stores:
            print store['title'] + ' ' + store['rank']
        print self.threadname + 'end......'

def main():
    queue = Queue()
    targets = Queue()
    stores = Queue()
    scrapy = []
    for i in range(0,13,6):
    #queue 原始請求
    #targets 等待解析的內容
    #stores解析完成的內容，這裏爲了簡單直觀，直接在線程中輸出了內容，並無使用該隊列
        queue.put(str(i))
        scrapy = Scrapy('scrapy', queue, targets)
        scrapy.start()
        parse = Parse('parse', targets, stores)
        parse.start()

if __name__ == '__main__':
    main()

運行結果

看這個運行結果，能夠看到，咱們的scrapy過程很快就完成了，咱們的parse也很早就開始了，但是在運行的時候，卻卡在parse上好長時間纔出的運行結果，每個解析結果出現，都須要3～5秒的時間，雖然我用的是臺老IBM破本，但按理說使用了多線程之後不該該會這麼慢的啊。

一樣的數據，咱們再看一下單線程下，運行結果。這裏爲了方便，我在上一個multi_scrapy里加入了redis，使用redis存儲爬行下來的原始頁面，這樣在single_parse.py裏面能夠單獨使用，更方便一些。

單線程版本：

代碼：

#!/usr/bin/env python
# coding=utf-8
from BeautifulSoup import BeautifulSoup
import redis


class Parse():
    def __init__(self, threadname, content):
        self.threadname = threadname
        self.content = content
        print threadname + 'start.....'

    def run(self):
        response = self.content
        if response:
            body = response.decode('gbk').encode('utf-8')
            soup = BeautifulSoup(body)
            ul_html = soup.find('ul',{'id':'list-container'})
            lists = ul_html.findAll('li',{'class':'list-item'})
            stores = []
            for list in lists:
                store= {}
                try:
                    infos = list.findAll('a',{'trace':'shop'})
                    for info in infos:
                        attrs = dict(info.attrs)
                        if attrs.has_key('class'):
                            if 'rank' in attrs['class']:
                                rank_string = attrs['class']
                                rank_num = rank_string[-2:]
                                if (rank_num[0] == '-'):
                                    store['rank'] = rank_num[-1]
                                else:
                                    store['rank'] = rank_num
                        if attrs.has_key('title'):
                            store['title'] = attrs['title']
                            store['href'] = attrs['href']
                except AttributeError:
                    pass
                if store:
                    stores.append(store)

            for store in stores:
                try:
                    print store['title'] + ' ' + store['rank']
                except KeyError:
                    pass
            print self.threadname + 'end......'
        else:
            pass

def main():
    r = redis.StrictRedis(host='localhost', port=6379)
    while True:
        content = r.lpop('targets')
        if (content):
            parse = Parse('parse', content)
            parse.run()
        else:
            break

if __name__ == '__main__':
    main()

運行結果：

結果能夠看到，單線程版本中，耗時其實和多線程是差很少的，上文中的多線程版本，雖然包含了獲取頁面的時間，可是地一個例子裏咱們已經分析了，使用多線程之後，三個頁面的抓取，徹底能夠在1s內完成的，也就是說，使用多線程進行數據解析，並無得到實質上的效率提升。

分析緣由

看兩個運行的CPU佔用，第一個127%,第二個98%,都是很是高的，這說明，在處理字符串解析匹配提取等運算密集型的工做時，並行的概念並無很好得獲得發揮
因爲共享數據不存在安全問題，因此上面的例子都是非線程安全的，並無爲共享數據加鎖，只是實現了最簡單的FIFO，因此也不會是由於鎖的開銷致使效率沒有獲得真正提升
網上搜索資料，發現python多線程彷佛並不能利用多核，問題彷佛就是出在這裏了，在python上開啓多個線程，因爲GIL的存在，每一個單獨線程都會在競爭到GIL後才運行，這樣就干預OS內部的進程(線程)調度，結果在多核CPU上，python的多線程實際是串行執行的，並不會同一時間多個線程分佈在多個CPU上運行。Python因爲有全鎖局的存在（同一時間只能有一個線程執行），並不能利用多核優點。因此，若是你的多線程進程是CPU密集型的，那多線程並不能帶來效率上的提高，相反還可能會由於線程的頻繁切換，致使效率降低；若是是IO密集型，多線程進程能夠利用IO阻塞等待時的空閒時間執行其餘線程，提高效率。
問題答案：因爲數據解析操做是CPU密集型的操做，而網絡請求是I/O密集型操做，因此出現了上述結果。

解決方法

GIL既然是針對一個python解釋器進程而言的，那麼，若是解釋器能夠多進程解釋執行，那就不存在GIL的問題了。一樣，他也不會致使你多個解釋器跑在同一個核上。因此，最好的解決方案，是多線程+多進程結合。經過多線程來跑I/O密集型程序，經過控制合適數量的進程來跑CPU密集型的操做，這樣就能夠跑慢CPU了:)