爬蟲實戰篇---使用Scrapy框架進行汽車之家寶馬圖片下載爬蟲

時間 2019-11-11

標籤爬蟲實戰使用 scrapy 框架進行汽車之家寶馬圖片下載欄目網絡爬蟲简体版

原文原文鏈接

（1）、前言

Scrapy框架爲文件和圖片的下載專門提供了兩個Item Pipeline 它們分別是：html

FilePipelineapp

ImagesPipeline框架

（2）、使用Scrapy內置的下載方法的好處

一、能夠有效避免重複下載dom

二、方便指定下載路徑異步

三、方便格式轉換，例如能夠有效的將圖片轉換爲png 或jpgscrapy

四、方便生成縮略圖ide

五、方便調整圖片大小函數

六、異步下載，高效率測試

（3）、較爲傳統的Scrapy框架圖片下載方式

一、建立項目：scrapy startproject baoma---cd baoma --建立爬蟲scrapy genspider spider car.autohome.com.cn優化

二、使用pycharm打開項目

改寫settings.py

不遵照robots協議

設置請求頭

開啓pipelines.py

改寫spider.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from ..items import BaomaItem
 4 
 5 class SpiderSpider(scrapy.Spider):
 6     name = 'spider'
 7     allowed_domains = ['car.autohome.com.cn']
 8     start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
 9 
10     def parse(self, response):
11         #SelecorList類型
12         uiboxs = response.xpath('//div[@class = "uibox"]')[1:] #第一個咱們不須要
13         for uibox in uiboxs:
14             catagory = uibox.xpath('.//div[@class = "uibox-title"]/a/text()').get()
15             urls = uibox.xpath('.//ul/li/a/img/@src').getall()
16             #遍歷列表，並將列表中的某一項執行函數操做，再將函數的返回值以列表的形式返回
17             #map()
18             # for url in urls:
19             #     # url = 'https:' + url
20             #     # print(url)
21             #     #方法二：
22             #     url = response.urljoin(url)
23             #     print(url)
24                 #方法三：
25                 #將列表中的每一項進行遍歷傳遞給lambda表達式，並執行函數中的代碼，再以返回值以列表形式進行返回,結果是map對象，接着使用list轉換爲列表
26             urls = list(map(lambda url:response.urljoin(url),urls))
27             item = BaomaItem(catagory = catagory,urls = urls)
28             yield item

改寫pipelines.py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 import os
 8 from urllib import request
 9 
10 class BaomaPipeline(object):
11     def __init__(self):
12         self.path = os.path.join(os.path.dirname(__file__), 'images') #os.path.dirname()獲取當前文件的路徑,os.path.join()獲取當前目錄並拼接成新目錄
13         if not os.path.exists(self.path):  # 判斷路徑是否存在
14             os.mkdir(self.path)
15 
16     def process_item(self, item, spider):
17         #分類存儲
18         catagory = item['catagory']
19         urls = item['urls']
20 
21         catagory_path = os.path.join(self.path,catagory)
22         if  not os.path.exists(catagory_path): #若是沒有該路徑即建立一個
23             os.mkdir(catagory_path)
24 
25         for url in urls:
26             image_name = url.split('_')[-1] #以_進行切割並取最後一個單元
27             request.urlretrieve(url,os.path.join(catagory_path,image_name))
28 
29 
30         return item

新建測試py（main.py）

1 #author: "xian"
2 #date: 2018/6/14
3 from scrapy import cmdline
4 cmdline.execute('scrapy crawl spider'.split())

運行結果：（咱們成功獲取了以catagory分類並以圖片地址_後的參數做爲圖片名的圖片）

Scrapy框架提供了兩個中間件一、下載文件的Files pipeline 和下載圖片的Image pipeline

下載文件的Files pipeline

使用步驟：

一、定義好一個item,而後定義兩個屬性file_urls 和 files . file_urls是用來存儲須要下載的文件的url連接，列表類型

二、當文件下載完成後，會把文件下載的相關信息存儲到item的files屬性中。例如：下載路徑，下載url 和文件的效驗碼

三、再配置文件settings.py中配置FILES_STORE,指定文件下載路徑

四、啓動pipeline,在ITEM_PIPELINES中設置scrapy.pipelines.files.FilesPipeline :1

下載圖片的Images Pipeline

使用步驟：

一、定義好一個item,而後定義兩個屬性image_urls 和 images. image_urls是用來存儲須要下載的文件的url連接，列表類型

二、當文件下載完成後，會把文件下載的相關信息存儲到item的images屬性中。例如：下載路徑，下載url 和文件的效驗碼

三、再配置文件settings.py中配置FILES_STORE,指定文件下載路徑

四、啓動pipeline,在ITEM_PIPELINES中設置scrapy.pipelines.images.ImagesPipeline :1

（4）、使用Images_pipeline進行圖片下載（仍是以汽車之家圖片爲例）

改寫settings.py

開啓本身定義的中間件

改寫pipelines,py

 1 # -*- coding: utf-8 -*-
 2 
 3 # Define your item pipelines here
 4 #
 5 # Don't forget to add your pipeline to the ITEM_PIPELINES setting
 6 # See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
 7 import os
 8 from urllib import request
 9 from scrapy.pipelines.images import ImagesPipeline
10 import settings
11 
12 class BaomaPipeline(object):
13     def __init__(self):
14         self.path = os.path.join(os.path.dirname(__file__), 'images') #os.path.dirname()獲取當前文件的路徑,os.path.join()獲取當前目錄並拼接成新目錄
15         if not os.path.exists(self.path):  # 判斷路徑是否存在
16             os.mkdir(self.path)
17 
18     def process_item(self, item, spider):
19         #分類存儲
20         catagory = item['catagory']
21         urls = item['urls']
22 
23         catagory_path = os.path.join(self.path,catagory)
24         if  not os.path.exists(catagory_path): #若是沒有該路徑即建立一個
25             os.mkdir(catagory_path)
26 
27         for url in urls:
28             image_name = url.split('_')[-1] #以_進行切割並取最後一個單元
29             request.urlretrieve(url,os.path.join(catagory_path,image_name))
30 
31 
32         return item
33 
34 
35 class BMWImagesPipeline(ImagesPipeline):  # 繼承ImagesPipeline
36     # 該方法在發送下載請求前調用，自己就是發送下載請求的
37     def get_media_requests(self, item, info):
38         request_objects = super(BMWImagesPipeline, self).get_media_requests(item, info)  # super()直接調用父類對象
39         for request_object in request_objects:
40             request_object.item = item
41         return request_objects
42 
43     def file_path(self, request, response=None, info=None):
44         path = super(BMWImagesPipeline, self).file_path(request, response, info)
45         # 該方法是在圖片將要被存儲時調用，用於獲取圖片存儲的路徑
46         catagory = request.item.get('catagory')
47         images_stores = settings.IMAGES_STORE #拿到IMAGES_STORE
48         catagory_path = os.path.join(images_stores,catagory)
49         if not os.path.exists(catagory_path): #判斷文件名是否存在,若是不存在建立文件
50             os.mkdir(catagory_path)
51         image_name = path.replace('full/','')
52         image_path = os.path.join(catagory_path,image_name)
53         return image_path

運行結果展現：

經過對比咱們能夠直觀感覺到下載速度明顯提升。

下面咱們對圖片進行優化，獲取高清圖片

經過分析縮略圖和高清圖的url，咱們發現縮略圖只是多了t_罷了

縮略圖地址：https://car3.autoimg.cn/cardfs/product/g24/M08/2F/9E/t_autohomecar__wKgHIVpogfqAIlTbAAUzcUgKoGY701.jpg

高清圖地址：https://car3.autoimg.cn/cardfs/product/g24/M08/2F/9E/autohomecar__wKgHIVpogfqAIlTbAAUzcUgKoGY701.jpg

（5）、下面咱們獲取全部的高清圖片

傳統思路以下：找到更多獲取接口的url,進入詳情頁--找分頁接口（顯然這種狀況會大大提升咱們的工做量，下面咱們使用Scrapy框架中的CrawlSpider進行爬取，由於CrawlSpider只要指定響應的規則，爬蟲會自動進行爬取，省事省力！）

咱們首先分析下url的規律：

https://car.autohome.com.cn/pic/series/65-1.html（更多的第一頁url）

https://car.autohome.com.cn/pic/series/65-1-p2.html（更多的第二頁url）

改寫spider.py

 1 # -*- coding: utf-8 -*-
 2 import scrapy
 3 from scrapy.spiders import CrawlSpider ,Rule#導入CrawlSpider模塊 需改寫原來的def parse(self,response)方法
 4 from scrapy.linkextractor import LinkExtractor #導入連接提取模塊
 5 from ..items import BaomaItem
 6 
 7 
 8 class SpiderSpider(CrawlSpider):
 9     name = 'spider'
10     allowed_domains = ['car.autohome.com.cn']
11     start_urls = ['https://car.autohome.com.cn/pic/series/65.html']
12 
13     rules = {
14         Rule(LinkExtractor(allow=r'https://car.autohome.com.cn/pic/series/65.+'),callback= 'parse_page',follow=True),
15 
16     } #如須要進行頁面解釋則使用callback回調函數 由於有下一頁，因此咱們須要跟進，這裏使用follow令其爲True
17 
18 
19     def parse_page(self, response): #頁面解析函數
20         catagory = response.xpath('//div[@class = "uibox"]/div/text()').get()
21         srcs = response.xpath('//div[contains(@class,"uibox-con")]/ul/li//img/@src').getall()
22         srcs = list(map(lambda x:x.replace('t_',''),srcs)) #map(函數，參數二)，將參數二中的每一個都進行函數計算並返回一個列表
23         # urls = {}
24         # for src in srcs:
25         #     url = response.url.join(src)
26         #     urls.append(url)
27         srcs = list(map(lambda x:response.urljoin(x),srcs))
28         yield BaomaItem(catagory=catagory,image_urls = srcs)