用Python多線程實現生產者消費者模式爬取鬥圖網的表情圖片

時間 2020-06-18

標籤 python 多線程實現生產者消費者模式表情圖片欄目 Python 简体版

原文原文鏈接

什麼是生產者消費者模式

某些模塊負責生產數據，這些數據由其餘模塊來負責處理（此處的模塊多是：函數、線程、進程等）。產生數據的模塊稱爲生產者，而處理數據的模塊稱爲消費者。在生產者與消費者之間的緩衝區稱之爲倉庫。生產者負責往倉庫運輸商品，而消費者負責從倉庫裏取出商品，這就構成了生產者消費者模式。python

生產者消費者模式的優勢

解耦
假設生產者和消費者分別是兩個線程。若是讓生產者直接調用消費者的某個方法，那麼生產者對於消費者就會產生依賴（也就是耦合）。若是將來消費者的代碼發生變化，可能會影響到生產者的代碼。而若是二者都依賴於某個緩衝區，二者之間不直接依賴，耦合也就相應下降了。併發
併發
因爲生產者與消費者是兩個獨立的併發體，他們之間是用緩衝區通訊的，生產者只須要往緩衝區裏丟數據，就能夠繼續生產下一個數據，而消費者只須要從緩衝區拿數據便可，這樣就不會由於彼此的處理速度而發生阻塞。app
支持忙閒不均
當生產者製造數據快的時候，消費者來不及處理，未處理的數據能夠暫時存在緩衝區中，慢慢處理掉。而不至於由於消費者的性能形成數據丟失或影響生產者生產。

實例

#!/usr/bin/python
# -*- coding: utf-8 -*-
# @Time    : 2017/12/4 16:29
# @Author  : YuLei Lan
# @Email   : lanyulei@renrenche.com
# @File    : urls.py
# @Software: PyCharm

import requests
import os
from urllib.request import urlretrieve
from bs4 import BeautifulSoup
import threading

BASE_PAGE_URL = 'http://www.doutula.com/photo/list/?page='
PAGE_URL_LIST = []  # 全部分頁的列表
FACE_URL_LIST = []  # 全部表情的url列表
gLock = threading.Lock()

def get_page_url():
    for i in range(1, 2):
        url = BASE_PAGE_URL + str(i)
        PAGE_URL_LIST.append(url)

def procuder():
    """
        生產者
        不斷生產出全部的可下載的img_url地址
    """

    while len(PAGE_URL_LIST) != 0:  # 不能使用for循環
        gLock.acquire()
        page_url = PAGE_URL_LIST.pop()
        gLock.release()

        response = requests.get(page_url)
        text = response.text
        soup = BeautifulSoup(text, 'lxml')
        img_list = soup.find_all('img', attrs={'class': 'img-responsive lazy image_dta'})

        gLock.acquire()
        for img in img_list:
            img_url = img['data-original']
            if not img_url.startswith('http:'):
                img_url = 'http:{}'.format(img_url)
            FACE_URL_LIST.append(img_url)
        gLock.release()

def customer():
    """ 消費者 """

    while True:
        if len(FACE_URL_LIST) == 0:
            continue
        else:
            img_url = FACE_URL_LIST.pop()
            tmp_list = img_url.split('/')
            filename = tmp_list.pop()
            path_file = os.path.join('images', filename)
            urlretrieve(img_url, filename=path_file)

def main():
    for i in range(3):
        th = threading.Thread(target=procuder)
        th.start()

    for i in range(5):
        th = threading.Thread(target=customer)
        th.start()

if __name__ == '__main__':
    get_page_url()
    main()