使用Scrapy框架對圖片按文件夾分類下載

時間 2019-11-08

原文原文鏈接

應室友需求，這裏我打算使用python的scrapy框架去下載「優美圖庫」上的美圖，話很少說這就開始行動。python

1、工欲善其事，必先利其器shell

第一步，配置scrapy爬蟲環境，先將pip換到國內阿里雲的源。在用戶目錄下新建pip/pip.ini：api

1 [global]
2 index-url = https://mirrors.aliyun.com/pypi/simple
3 trusted-host = mirrors.aliyun.com

我這裏是Windows環境，要手動下載安裝Twisted包，不然會提示須要VC++14.0：瀏覽器

下載地址：https://www.lfd.uci.edu/~gohlke/pythonlibs/#twisted框架

找到和本身python對應版本的包，我這裏是python3.7-32(學校機房都是32位系統...)，因此應選擇：dom

而後安裝scrapy：$ pip install scrapyscrapy

此外，因爲咱們是爬取圖片，須要pillow包，使用scrapy shell則要安裝win32api，因而這裏一併裝上：$ pip install pillow pypiwin32編輯器

***很是重要***ide

Python採用縮進來控制代碼塊，請不要混用tap和空格，不然會報語法錯誤。我這裏使用notepad++編輯器，可在設置>首選項>語言>製表符設置替換爲空格：函數

準備工做完畢之後，如今來開始咱們的爬蟲之旅。

2、Scrapy基礎

首先新建一個項目：$ scrapy startproject girls

其中girls爲項目名稱，能夠看到將會建立一個girls文件夾，目錄結構大體以下：

這些文件分別是：

scrapy.cfg：項目的配置文件

girls/items.py：項目的目標文件

girls/pipelines.py：項目的管道文件

girls/settings.py：項目的設置文件

girls/spiders/：存儲爬蟲代碼目錄

而後須要明確目標：咱們要抓取https://www.umei.cc/tags/meinv.htm網站裏的全部美圖。打開girls目錄下的items.py文件，編輯GirlsItem類：

1 class GirlsItem(scrapy.Item):
2     # define the fields for your item here like:
3     name = scrapy.Field() #圖片名
4     image_url = scrapy.Field() #圖片連接
5     pass

而後進入到girls目錄下輸入以下命令建立一個girl爬蟲：

$ scrapy genspider girl 'umei.cc'

會看到spiders目錄下新增了girl.py文件，主要是編寫這個爬蟲文件。首先將起始網頁修改成：

1 start_urls = ['http://www.umei.cc/tags/meinv.htm']

分析網頁的程序在parse方法裏編寫。

3、下載圖片

接下來分析一下目標網頁，用谷歌瀏覽器打開網頁，也在控制檯用scrapy shell打開網址：

$ scrapy shell http://www.umei.cc/tags/meinv.htm

在瀏覽器端使用右鍵檢查圖片元素能夠看到每一頁都是一個TypeList列表，找到TypeList下的超連接：

>>> list = response.xpath('//div[@class="TypeList"]/ul/li/a/@href').extract()

能夠輸出list看看：

隨便進入一個超連接：

$ scapy shell https://www.umei.cc/meinvtupian/meinvxiezhen/194087.htm

會看到下面有關於這個主題的一系列圖片，咱們的目標是把這些圖片按標題文件夾分類存放，圖片名稱則按順序編號。

因而咱們先找到圖片：

>>> img = response.xpath('//div[@class="ImageBody"]/p/a/img')

能夠看到img的alt屬性就是咱們要的文件夾名，而src就是要下載的圖片，能夠把圖片地址顯示出來看看：

找到當前圖片的索引，按順序編號存放：

>>> index = response.xpath('//li[@class="thisclass"]/a/text()').extract_first()

不排除圖片有png或其餘格式的，爲了寫出更健壯的程序，因此應該取出url最後的文件擴展名：

1 # 新建一個圖片對象
2 item = GirlsItem()
3 item['image_url'] = img.xpath('@src').extract_first()
4 image_type = item['image_url'].split('.')[-1] # 獲取圖片擴展名
5 item['name'] = img.xpath('@alt').extract_first() + '/' + index + '.' + image_type
6 yield item

接下來即是獲取下一張圖片，找到底部分頁索引：

提取出下一頁的連接：

>>> next_page = response.xpath('//div[@class="NewPages"]/ul/li/a/@href')[-1].extract()

須要注意這是一個相對地址，咱們要的是絕對地址，還好scrapy已經幫咱們解決了問題：

>>> response.urljoin(next_page)

最前面的首頁下面也有一個分頁索引，也就是TypeList的分頁索引，固然也不能放過：

須要回到上一頁的地址，而後：

>>> next_url = response.xpath('//div[@class="NewPages"]/ul/li')[-2].xpath('a/@href').extract_first()

取出「下一頁」的連接。

接下來只須要按照上面的流程遞歸的爬取每一張圖片便可，完整的girl.py文件內容以下：

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from girls.items import GirlsItem
 4 
 5 class GirlSpider(scrapy.Spider):
 6     name = 'girl'
 7     allowed_domains = ['umei.cc']
 8     start_urls = ['http://www.umei.cc/tags/meinv.htm']
 9 
10     def parse(self, response):
11         list = response.xpath('//div[@class="TypeList"]/ul/li/a/@href').extract()
12         for url in list:
13             yield scrapy.Request(response.urljoin(url), callback = self.myParse)
14         
15         next_url = response.xpath('//div[@class="NewPages"]/ul/li')[-2].xpath('a/@href').extract_first()
16         if next_url:
17             yield scrapy.Request(response.urljoin(next_url), callback = self.parse)
18     
19     # 下載每個主題內的圖片
20     def myParse(self, response):
21         img = response.xpath('//div[@class="ImageBody"]/p/a/img')
22         index = response.xpath('//li[@class="thisclass"]/a/text()').extract_first()
23         
24         item = GirlsItem()
25         item['image_url'] = img.xpath('@src').extract_first()
26         image_type = item['image_url'].split('.')[-1]
27         item['name'] = img.xpath('@alt').extract_first() + '/' + index + '.' + image_type
28         yield item
29         
30         next_page = response.xpath('//div[@class="NewPages"]/ul/li/a/@href')[-1].extract()
31         if next_page != '#':
32             yield scrapy.Request(response.urljoin(next_page), callback = self.myParse)

重要的是管道文件(pipelines.py)的編寫，咱們要繼承scrapy的ImagesPipeline，先重載get_media_requests方法將文件名封裝到元數據(meta)裏，而後重載路徑函數file_path獲取文件名，pipelines.py修改以下：

1 from scrapy.pipelines.images import ImagesPipeline
2 from scrapy import Request
3 
4 class GirlsPipeline(ImagesPipeline):
5     def get_media_requests(self, item, info):
6         yield Request(item['image_url'], meta = {'name': item['name']})
7     
8     def file_path(self, request, response = None, info = None):
9         return request.meta['name']