安裝scrapyphp
pip install scrapyhtml
新建項目python
(python36) E:\www>scrapy startproject fileDownload New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in: E:\www\fileDownload You can start your first spider with: cd fileDownload scrapy genspider example example.com (python36) E:\www>
(python36) E:\www>scrapy startproject fileDownload New Scrapy project 'fileDownload', using template directory 'c:\users\brady\.conda\envs\python36\lib\site-packages\scrapy\templates\project', created in: E:\www\fileDownload You can start your first spider with: cd fileDownload scrapy genspider example example.com (python36) E:\www>
編輯爬蟲提取內容git
# -*- coding: utf-8 -*- import scrapy from scrapy.linkextractors import LinkExtractor from scrapy.spiders import CrawlSpider, Rule from fileDownload.items import FiledownloadItem class PexelsSpider(CrawlSpider): name = 'pexels' allowed_domains = ['www.pexels.com'] start_urls = ['https://www.pexels.com/photo/white-concrete-building-2559175/'] rules = ( Rule(LinkExtractor(allow=r'/photo/'), callback='parse_item', follow=True), ) def parse_item(self, response): print(response.url) url = response.xpath("//img[contains(@src,'photos')]/@src").extract() item = FiledownloadItem() try: item['file_urls'] = url print("爬取到圖片列表 " + url) yield item except Exception as e: print(str(e))
配置itemgithub
class FiledownloadItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() file_urls = scrapy.Field()
setting.pyapp
啓用文件管道dom
'scrapy.pipelines.files.FilesPipeline':2 文件管道scrapy
FILES_STORE='' //存儲路徑ide
item裏面ui
file_urls = scrapy.Field()
files = scrapy.field()
爬蟲裏面 改成file_urls參數傳遞到管道
重寫文件管道 保存文件名爲圖片原名
pipelines.php裏面 新建本身圖片管道,繼承圖片管道
# -*- coding: utf-8 -*- # Define your item pipelines here # # Don't forget to add your pipeline to the ITEM_PIPELINES setting # See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html from scrapy.pipelines.files import FilesPipeline class FiledownloadPipeline(object): def process_item(self, item, spider): tmp = item['file_urls'] item['file_urls'] = [] for i in tmp: if "?" in i: item['file_urls'].append(i.split('?')[0]) else: item['file_urls'].append(i) print(item) return item class MyFilesPipeline(FilesPipeline): def file_path(self, request, response=None, info=None): file_path = request.url file_path = file_path.split('/')[-1] print("下載圖片"+ file_path) return 'full/%s' % (file_path)
setting.py 改成啓用本身文件管道
ITEM_PIPELINES = { 'fileDownload.pipelines.FiledownloadPipeline': 1, 'fileDownload.pipelines.MyFilesPipeline': 2, #'scrapy.pipelines.files.FilesPipeline':2 }
github地址
https://github.com/brady-wang/spider-fileDownload