scrapy文件管道

安裝scrapyphp

pip install scrapyhtml

新建項目python

(python36) E:\www>scrapy startproject fileDownload
New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:
    E:\www\fileDownload

You can start your first spider with:
    cd fileDownload
    scrapy genspider example example.com

(python36) E:\www>
(python36) E:\www>scrapy startproject fileDownload
New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in:
    E:\www\fileDownload

You can start your first spider with:
    cd fileDownload
    scrapy genspider example example.com

(python36) E:\www>

 

編輯爬蟲提取內容git

# -*- coding: utf-8 -*-
import scrapy
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule

from  fileDownload.items import  FiledownloadItem

class PexelsSpider(CrawlSpider):
    name = 'pexels'
    allowed_domains = ['www.pexels.com']
    start_urls = ['https://www.pexels.com/photo/white-concrete-building-2559175/']

    rules = (
        Rule(LinkExtractor(allow=r'/photo/'), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        print(response.url)
        url = response.xpath("//img[contains(@src,'photos')]/@src").extract()
        item = FiledownloadItem()
        try:
            item['file_urls'] = url
            print("爬取到圖片列表 " + url)
            yield item
        except Exception as  e:
            print(str(e))

 

配置itemgithub

 

class FiledownloadItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    file_urls = scrapy.Field()

 

  

setting.pyapp

啓用文件管道dom

 

'scrapy.pipelines.files.FilesPipeline':2  文件管道scrapy

 

FILES_STORE=''  //存儲路徑ide

 

item裏面ui

file_urls = scrapy.Field()

files = scrapy.field()

 

爬蟲裏面 改成file_urls參數傳遞到管道

 

 

 

重寫文件管道 保存文件名爲圖片原名

pipelines.php裏面 新建本身圖片管道,繼承圖片管道

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html

from scrapy.pipelines.files import  FilesPipeline
class FiledownloadPipeline(object):
    def process_item(self, item, spider):
        tmp = item['file_urls']
        item['file_urls'] = []

        for i in tmp:
            if "?" in i:
                item['file_urls'].append(i.split('?')[0])
            else:
                item['file_urls'].append(i)
        print(item)
        return item


class  MyFilesPipeline(FilesPipeline):
    def file_path(self, request, response=None, info=None):
        file_path = request.url
        file_path = file_path.split('/')[-1]
        print("下載圖片"+ file_path)
        return 'full/%s' % (file_path)

setting.py 改成啓用本身文件管道

ITEM_PIPELINES = {
    'fileDownload.pipelines.FiledownloadPipeline': 1,
    'fileDownload.pipelines.MyFilesPipeline': 2,
    #'scrapy.pipelines.files.FilesPipeline':2
}

 

  github地址

https://github.com/brady-wang/spider-fileDownload

相關文章
相關標籤/搜索