Scrapy提供了一個 item pipeline ,來下載屬於某個特定項目的圖片,好比,當你抓取產品時,也想把它們的圖片下載到本地。windows
這條管道,被稱做圖片管道,在 ImagesPipeline 類中實現,提供了一個方便並具備額外特性的方法,來下載並本地存儲圖片:scrapy
-將全部下載的圖片轉換成通用的格式(JPG)和模式(RGB)ide
Pillow 是用來生成縮略圖,並將圖片歸一化爲JPEG/RGB格式,所以爲了使用圖片管道,你須要安裝這個庫。 Python Imaging Library (PIL) 在大多數狀況下是有效的,但衆所周知,在一些設置裏會出現問題,所以咱們推薦使用 Pillow 而不是 PIL.。url
在windows下,利用pip安裝PIL找不到能夠安裝的版本,因此選用Pillow,順利運行code
下面是抓取百度貼吧的一個小demo隊列
spider文件夾下的spider baidu.py圖片
import scrapy import requests import os from BaiduTieba.items import BaidutiebaItem class BaiduTieBaSpider(scrapy.spiders.Spider): name = 'baidutieba' start_urls = ['http://tieba.baidu.com/p/2235516502?see_lz=1&pn=%d' % i for i in range(1, 38)] image_names = {} def parse(self, response): item = BaidutiebaItem() item['image_urls'] = response.xpath("//img[@class='BDE_Image']/@src").extract() for index, value in enumerate(item['image_urls']): number = self.start_urls.index(response.url) * len(item['image_urls']) + index self.image_names[value] = 'full/%04d.jpg' % number yield item
注意在引用Item類時的路徑ip
items.pyget
import scrapy class BaidutiebaItem(scrapy.Item): image_urls = scrapy.Field() images = scrapy.Field() image_paths = scrapy.Field()
ImagePipeline.pyrequests
import scrapy from scrapy.contrib.pipeline.images import ImagesPipeline from scrapy.exceptions import DropItem from BaiduTieba.spiders.baidu import BaiduTieBaSpider class MyImagesPipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): image_name = BaiduTieBaSpider.image_names[request.url] return image_name def get_media_requests(self, item, info): for image_url in item['image_urls']: yield scrapy.Request(image_url) def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") item['image_paths'] = image_paths return item
setting.py
BOT_NAME = 'BaiduTieba' SPIDER_MODULES = ['BaiduTieba.spiders'] NEWSPIDER_MODULE = 'BaiduTieba.spiders' ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'BaiduTieba.ImagePipeline.MyImagesPipeline': 300, } IMAGES_STORE = '/baidutieba.01'
IMAGES_STORE是下載圖片的保存路徑。