今天要學習的是圖片下載,Scrapy用ImagesPipeline類提供一種方便的方式來下載和存儲圖片;css
(1)首先仍是使用dribbble.com這個網站來爬取數據,先在項目中的dribbble.py文件中根據響應來獲取圖片的src屬性,這樣咱們就能夠獲取到了圖片的路徑了,這個咱們以前已經學過了;html
(2)而後在items.py文件中根據本身的需求添加字段,這裏咱們能夠根據需求建立圖片地址的字段、標題字段、時間字段等node
import scrapy class XkdDribbbleSpiderItem(scrapy.Item): title = scrapy.Field() image_url = scrapy.Field() date = scrapy.Field()
from .pipelines import ImagePipeline import os # 獲取項目根目錄 BASE_DIR = os.path.dirname(os.path.abspath(__file__)) ITEM_PIPELINES = { # 'XKD_Dribbble_Spider.pipelines.XkdDribbbleSpiderPipeline': 300, # 當items.py模塊yield以後,默認就是下載image_url的頁面 'scrapy.pipelines.images.ImagePipeline': 1, } # 獲取item中,image_url的地址,而且下載 IMAGES_URLS_FIELD = 'image_url' # 指定圖片下載存儲的路徑 IMAGES_STORE = os.path.join(BASE_DIR, 'images')
import scrapy from urllib import parse from scrapy.http import Request from ..items import XkdDribbbleSpiderItem from datetime import datetime class DribbbleSpider(scrapy.Spider): name = 'dribbble' allowed_domains = ['dribbble.com'] start_urls = ['https://dribbble.com/stories'] def parse(self, response): # 獲取a標籤的url值 # urls = response.css('h2 a::attr(href)').extract() a_nodes = response.css('header div.teaser a') for a_node in a_nodes: # print(a_node) a_url = a_node.css('::attr(href)').extract()[0] a_image_url = a_node.css('img::attr(src)').extract()[0] yield Request(url=parse.urljoin(response.url, a_url), callback=self.parse_analyse, meta={'a_image_url': a_image_url}) def parse_analyse(self, response): a_image_url = response.meta.get('a_image_url') title = response.css('.post header h1::text').extract()[0] date = response.css('span.date::text').extract_first() date = date.strip() date = datetime.strptime(date, '%b %d, %Y').date() # 構建模型 dri_item = XkdDribbbleSpiderItem() dri_item['a_image_url'] = a_image_url dri_item['title'] = title dri_item['date'] = date yield dri_item
# 導入自定義ImagePipeline須要的庫 from scrapy.http import Request from scrapy.utils.python import to_bytes import hashlib from scrapy.pipelines.images import ImagesPipeline from datetime import datetime class XkdDribbbleSpiderPipeline(object): def process_item(self, item, spider): return item class ImagePipeline(ImagesPipeline): def file_path(self, request, response=None, info=None): ## start of deprecation warning block (can be removed in the future) def _warn(): from scrapy.exceptions import ScrapyDeprecationWarning import warnings warnings.warn('ImagesPipeline.image_key(url) and file_key(url) methods are deprecated, ' 'please use file_path(request, response=None, info=None) instead', category=ScrapyDeprecationWarning, stacklevel=1) # check if called from image_key or file_key with url as first argument if not isinstance(request, Request): _warn() url = request else: url = request.url # detect if file_key() or image_key() methods have been overridden if not hasattr(self.file_key, '_base'): _warn() return self.file_key(url) elif not hasattr(self.image_key, '_base'): _warn() return self.image_key(url) ## end of deprecation warning block image_guid = hashlib.sha1(to_bytes(url)).hexdigest() # change to request.url after deprecation # 修改成時間爲目錄 return '{}/{}.jpg'.format(datetime.now().year,image_guid)
運行代碼咱們能看到打印出來的信息,顯示的字段信息是根據咱們在蜘蛛文件中構建的模型決定的。而後這些圖片就會下載到咱們指定的文件夾中python
Item Pipeline又稱之爲管道,顧名思義就是對數據的過濾處理,主要做用包括:清理HTML數據、驗證爬取數據,檢查爬取字段、查重並丟棄重複內容、將爬取結果保存到數據庫等;數據庫
建立一個項目的時候都會自帶pipeline,pipeline的幾個核心方法有:dom
open_spider(spider)
:在開啓spider的時候觸發的,經常使用於初始化操做,如常見開啓數據庫鏈接或打開文件;scrapy
close_spider(spider)
:在關閉spider的時候觸發的,經常使用於關閉數據庫鏈接;ide
process_item(item, spider)
:item表示被爬取的item,spider 表示爬取該item的spider,每一個item pipeline組件都須要調用該方法,這個方法必須返回一個 Item (或任何繼承類)對象, 或是拋出 DropItem 異常,被丟棄的item將不會被以後的pipeline組件所處理;post
from_crawler(cls, crawler)
:是一個類方法,經常使用於從settings.py獲取配置信息;學習