Python爬蟲 - scrapy - 爬取妹子圖 Lv3

時間 2019-12-04

標籤 python 爬蟲 scrapy lv3 欄目 Python 简体版

原文原文鏈接

0. 前言

按計劃把爬妹子圖的爬蟲文章寫完，此次重寫了pipeline，對圖片進行重命名，並分目錄存儲。git

1.pipelines

源碼簡單直接上了。。。github

from scrapy.pipelines.images import ImagesPipeline
from scrapy.http import Request

class SpiderMeizituPipeline(ImagesPipeline):
    @classmethod
    def get_media_requests(self, item, info):
        return Request(item['image_urls'][0],meta={'imagegroup':item['images'],'imageindex':item['index']})

    def file_path(self, request, response=None, info=None):
        imagegroup = request.meta['imagegroup']
        imageindex = request.meta['imageindex']
        filepath = 'full/%s/%s.jpg' % (imagegroup,imageindex)
        return filepath

網上用Scrapy爬圖後更名的文章不少，代碼稍有不一樣，原理都是同樣。Scrapy原生的ImagesPipeline中，只能用SHA1哈希值存儲文件名，若是咱們要存儲自定義的文件名（本身定義文件名固然要有意義的，而文件含義通常存儲在item中）就要重寫ImagePipeline。在原生的ImagesPipeline類中，設置存儲路徑和文件名的方法是file_path，咱們看源碼，它的參數只有4個：self,request, response,info。並無item，那麼就要想辦法把item中的內容傳輸到file_path中。所以，就要重寫get_media_requests和file_path方法。scrapy

get_media_requests的做用是解析item中的image_urls，並把url逐個發送給Scrapy引擎，這裏同網上其餘文章同樣，爲request添加meta，這樣就把圖片名稱和序號傳給了file_path()
file_path的做用就是定義圖片存儲路徑和文件名，沒有太多好說的了。ide

2.完整源碼

https://github.com/YangChina/...url

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。