爬蟲框架之Scrapy（四 ImagePipeline）

時間 2020-05-30

標籤爬蟲框架 scrapy imagepipeline 欄目網絡爬蟲简体版

原文原文鏈接

ImagePipelinehtml

使用scrapy框架咱們除了要下載文本，還有可能須要下載圖片，scrapy提供了ImagePipeline來進行圖片的下載。web

ImagePipeline還支持如下特別的功能：app

1 生成縮略圖：經過配置IMAGES_THUMBS = {'size_name': (width_size,heigh_size),}框架

2 過濾小圖片：經過配置IMAGES_MIN_HEIGHT和IMAGES_MIN_WIDTH來過濾太小的圖片。dom

具體其餘功能能夠看下參考官網手冊:https://docs.scrapy.org/en/latest/topics/media-pipeline.html.scrapy

ImagePipelines的工做流程

1 在spider中爬取須要下載的圖片連接，將其放入item中的image_urls.ide

2 spider將其傳送到pipieline網站

3 當ImagePipeline處理時，它會檢測是否有image_urls字段，若是有的話，會將url傳遞給scrapy調度器和下載器ui

4 下載完成後會將結果寫入item的另外一個字段images，images包含了圖片的本地路徑，圖片校驗，和圖片的url。url

示例爬取巴士lol的英雄美圖

只爬第一頁

http://lol.tgbus.com/tu/yxmt/

第一步:items.py

import scrapy class Happy4Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field()

爬蟲文件lol.py

# -*- coding: utf-8 -*- import scrapy from happy4.items import Happy4Item class LolSpider(scrapy.Spider): name = 'lol' allowed_domains = ['lol.tgbus.com'] start_urls = ['http://lol.tgbus.com/tu/yxmt/'] def parse(self, response): li_list = response.xpath('//div[@class="list cf mb30"]/ul//li') for one_li in li_list: item = Happy4Item() item['image_urls'] =one_li.xpath('./a/img/@src').extract() yield item

最後 settings.py

BOT_NAME = 'happy4' SPIDER_MODULES = ['happy4.spiders'] NEWSPIDER_MODULE = 'happy4.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0' # Obey robots.txt rules ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'scrapy.pipelines.images.ImagesPipeline': 1, } IMAGES_STORE = 'images'

不須要操做管道文件，就能夠爬去圖片到本地

極大的減小了代碼量.

注意：由於圖片管道會嘗試將全部圖片都轉換成JPG格式的，你看源代碼的話也會發現圖片管道中文件名類型直接寫死爲JPG的。因此若是想要保存原始類型的圖片，就應該使用文件管道。

示例爬取mm131美女圖片

http://www.mm131.com/xinggan/

要求爬取的就是這個網站

這個網站是有反爬的，當你直接去下載一個圖片的時候，你會發現url被從新指向了別處，或者有多是直接報錯302，這是由於它使用了referer這個請求頭裏的字段，當你打開一個圖片的url的時候，你的請求頭裏必須有referer，否則就會被識別爲爬蟲文件，禁止你的爬取，那麼如何解決呢？

手動在爬取每一個圖片的時候添加referer字段。

xingan.py

# -*- coding: utf-8 -*- import scrapy from happy5.items import Happy5Item import re class XinganSpider(scrapy.Spider): name = 'xingan' allowed_domains = ['www.mm131.com'] start_urls = ['http://www.mm131.com/xinggan/'] def parse(self, response): every_html = response.xpath('//div[@class="main"]/dl//dd') for one_html in every_html[0:-1]: item = Happy5Item() # 每一個圖片的連接 link = one_html.xpath('./a/@href').extract_first() # 每一個圖片的名字 title = one_html.xpath('./a/img/@alt').extract_first() item['title'] = title # 進入到每一個標題裏面 request = scrapy.Request(url=link, callback=self.parse_one, meta={'item':item}) yield request # 每一個人下面的圖集 def parse_one(self, response): item = response.meta['item'] # 找到總頁數 total_page = response.xpath('//div[@class="content-page"]/span[@class="page-ch"]/text()').extract_first() num = int(re.findall('(\d+)', total_page)[0]) # 找到當前頁數 now_num = response.xpath('//div[@class="content-page"]/span[@class="page_now"]/text()').extract_first() now_num = int(now_num) # 當前頁圖片的url every_pic = response.xpath('//div[@class="content-pic"]/a/img/@src').extract() # 當前頁的圖片url item['image_urls'] = every_pic # 當前圖片的refer item['referer'] = response.url yield item # 若是當前數小於總頁數 if now_num < num: if now_num == 1: url1 = response.url[0:-5] + '_%d'%(now_num+1) + '.html' elif now_num > 1: url1 = re.sub('_(\d+)', '_' + str(now_num+1), response.url) headers = { 'referer':self.start_urls[0] } # 給下一頁發送請求 yield scrapy.Request(url=url1, headers=headers, callback=self.parse_one, meta={'item':item})

items.py

import scrapy class Happy5Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() image_urls = scrapy.Field() images = scrapy.Field() title = scrapy.Field() referer = scrapy.Field()

pipelines.py

from scrapy.pipelines.images import ImagesPipeline from scrapy.exceptions import DropItem from scrapy.http import Request class Happy5Pipeline(object): def process_item(self, item, spider): return item class QiushiImagePipeline(ImagesPipeline): # 下載圖片時加入referer請求頭 def get_media_requests(self, item, info): for image_url in item['image_urls']: headers = {'referer':item['referer']} yield Request(image_url, meta={'item': item}, headers=headers) # 這裏把item傳過去，由於後面須要用item裏面的書名和章節做爲文件名 # 獲取圖片的下載結果, 控制檯查看 def item_completed(self, results, item, info): image_paths = [x['path'] for ok, x in results if ok] if not image_paths: raise DropItem("Item contains no images") return item # 修改文件的命名和路徑 def file_path(self, request, response=None, info=None): item = request.meta['item'] image_guid = request.url.split('/')[-1] filename = './{}/{}'.format(item['title'], image_guid) return filename

settings.py

BOT_NAME = 'happy5' SPIDER_MODULES = ['happy5.spiders'] NEWSPIDER_MODULE = 'happy5.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent USER_AGENT = 'Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Trident/5.0' # Obey robots.txt rules ROBOTSTXT_OBEY = False ITEM_PIPELINES = { # 'scrapy.pipelines.images.ImagesPipeline': 1, 'happy5.pipelines.QiushiImagePipeline': 2, } IMAGES_STORE = 'images'