python爬蟲框架scrapy學習圖片下載

文檔地址:http://scrapy-chs.readthedocs.io/zh_CN/latest/topics/images.htmlhtml

實踐例子: 目的:抓取http://www.hlhua.com/頁面裏面商品的圖片git

  1. 根據文檔所說,先建立item用來保存圖片數據,爲了可以使ImagesPipeLine生效,這個item須要有名爲image_urls的field屬性: items.py
import scrapy

    class MyItem(scrapy.Item):
        image_urls = scrapy.Field()
        image_paths = scrapy.Field()
        images = scrapy.Field()
  1. 繼承ImagesPipeLine編寫本身的ImagesPipeLine pipeline.py
import scrapy
    from scrapy.exceptions import DropItem
    from scrapy.pipelines.images import ImagesPipeline

    class MyImageDownloadPipeLine(ImagesPipeline):

        def get_media_requests(self, item, info):
            for image_url in item['image_urls']:
                yield scrapy.Request(image_url)

        def item_completed(self, results, item, info):
            image_paths = [x['path'] for ok, x in results if ok]
            if not image_paths:
                raise DropItem("Item contains no images")
            item['image_paths'] = image_paths
            return item

這裏重寫的item_completed用來在下載完成後保存image_path屬性 3. 編輯settings.py使能MyImageDownloadPipeLine settings.pygithub

# coding=utf-8
    BOT_NAME = 'imagedemo'

    SPIDER_MODULES = ['imagedemo.spiders']
    NEWSPIDER_MODULE = 'imagedemo.spiders'

    # 使能ImagePipeLine
    ITEM_PIPELINES = {'imagedemo.pipelines.MyImageDownloadPipeLine': 1}
    # 指定圖片文件保存的未知
    IMAGES_STORE = 'image'

    ROBOTSTXT_OBEY = True
  1. 編寫spider實現爬蟲邏輯 spider.py
# coding=utf-8
    from scrapy.spiders import Spider
    from imagedemo.items import MyItem

    class ImageSpider(Spider):
        name = 'hlhua'
        start_urls = ['http://www.hlhua.com/']

        def parse(self, response):
            # inspect_response(response, self)
            images = []
            for each in response.xpath("//img[@class='goodsimg']/@src").extract():
                m = MyItem()
                m['image_urls'] = [each,]
                images.append(m)
            return images
  1. 執行scrapy crawl hlhua -o images.json,便可在image/full/下載圖片,並生成images.json記錄圖片信息。

github: https://github.com/chenglp1215/scrapy_demo/tree/master/imagedemojson

相關文章
相關標籤/搜索