python 豆瓣圖片的爬取

時間 2019-11-19

標籤 python 豆瓣圖片欄目 Python 简体版

原文原文鏈接

豆瓣圖片的抓取：在python中實現生產者和消費者模型的實現，你們能夠參考這篇文章 http://www.bkjia.com/Pythonjc/978391.htmlhtml

我的認爲是講的比較易懂的，只要看看仿寫幾個例子，感受這一塊就差很少了。下面的代碼並無抓取豆瓣相冊的所有，這是找了一個推薦較多的抓取來玩玩，也只抓取前面20頁，每頁有30張圖片，因此能夠根據這個去跟新url。維護了一個list來保存圖片的url，一個消費者函數來下載圖片，一個生產者函數來取圖片的url , 下面看代碼：python

# _*_coding:utf-8_*_

import urllib2
import cookielib
from bs4 import BeautifulSoup
import re
import time
import threading

start = start_time = time.ctime()
s = []
max_length = 30
condition = threading.Condition()


class Producer(threading.Thread):
    def run(self):
        for i in xrange(20):
            condition.acquire()
            if len(s) == max_length:
                print 's is full'
                condition.wait()
            request_url = 'https://site.douban.com/widget/public_album/86320/?start=%s' % (i*30) #推薦相冊的url

            headers = {
                    'User-Agent': 'Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/35.0.1916.153 Safari/537.36',
                }

            opener = urllib2.build_opener(urllib2.HTTPCookieProcessor(cookielib.CookieJar()))
            response = urllib2.Request(request_url, headers=headers)
            html = opener.open(response)
            soup = BeautifulSoup(html)
            img_urls = soup.find_all('a', {'class': 'album_photo'})

            for item in img_urls:
                p = re.compile(r'src="(.*?)"')
                img_url = p.search(str(item.find('img'))).group(1)  #圖片的url
                s.append(img_url)
                print 'producer somthing'
            condition.notify()


class Consumer(threading.Thread):
    def run(self):
        count = 0
        while True:
            if condition.acquire():
                if not s:
                    print 's is empty wait'
                    condition.wait()
                img_url = s.pop(0)
                print 'consumer something'
                with open('E:\\douban\\%s.jpg' %count, 'wb') as fp:
                    try:
                        response_img = urllib2.urlopen(img_url).read() #下載圖片
                        fp.write(response_img)
                    except Exception:
                        print 'error'
                count += 1
                condition.notify()
                condition.release()

t1 = Producer()
c1 = Consumer()
t1.start()
c1.start()

嗯，差很少就這樣，你們還能夠多開幾個線程，使得下載速度更快點，畢竟若是抓取大量的圖片的話，io操做會比較耗時。cookie

這是圖片的部分截圖：app

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。