Python圖片爬取方法總結

時間 2021-02-16

標籤 html python web 服務器 dom scrapy ide 函數 post 欄目 Python 简体版

原文原文鏈接

1. 最多見爬取圖片方法

對於圖片爬取，最容易想到的是經過urllib庫或者requests庫實現。具體兩種方法的實現以下：html

1.1 urllib

使用urllib.request.urlretrieve方法，經過圖片url和存儲的名稱完成下載。python

'''
Signature: request.urlretrieve(url, filename=None, reporthook=None, data=None)
Docstring:
Retrieve a URL into a temporary location on disk.

Requires a URL argument. If a filename is passed, it is used as
the temporary file location. The reporthook argument should be
a callable that accepts a block number, a read size, and the
total file size of the URL target. The data argument should be
valid URL encoded data.

If a filename is passed and the URL points to a local resource,
the result is a copy from local file to new file.

Returns a tuple containing the path to the newly created
data file as well as the resulting HTTPMessage object.
File:      ~/anaconda/lib/python3.6/urllib/request.py
Type:      function
'''

參數 finename 指定了保存本地路徑（若是參數未指定，urllib會生成一個臨時文件保存數據。）web
參數 reporthook 是一個回調函數，當鏈接上服務器、以及相應的數據塊傳輸完畢時會觸發該回調，咱們能夠利用這個回調函數來顯示當前的下載進度。服務器
參數 data 指 post 到服務器的數據，該方法返回一個包含兩個元素的(filename, headers)元組，filename 表示保存到本地的路徑，header 表示服務器的響應頭。dom

使用示例：scrapy

request.urlretrieve('https://img3.doubanio.com/view/photo/photo/public/p454345512.jpg', 'kids.jpg')

但頗有可能返回403錯誤（Forbidden），如：http://www.qnong.com.cn/uploa...。Stack Overflow指出緣由：This website is blocking the user-agent used by urllib, so you need to change it in your request.ide

給urlretrieve加上User-Agent還挺麻煩，方法以下：函數

import urllib

opener = request.build_opener()
headers = ('User-Agent', 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10.12; rv:53.0) Gecko/20100101 Firefox/53.0')
opener.addheaders = [headers]
request.install_opener(opener)
request.urlretrieve('http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg', './dog.jpg')

1.2 requests

使用requests.get()獲取圖片，但要將參數stream設爲True。post

import requests

req = requests.get('http://www.qnong.com.cn/uploadfile/2016/0416/20160416101815887.jpg', stream=True)

with open('dog.jpg', 'wb') as wr:
    for chunk in req.iter_content(chunk_size=1024):
        if chunk:
            wr.write(chunk)
            wr.flush()

requests添加User-Agent也很方便，使用headers參數便可。ui

2. Scrapy 支持的方法

2.1 ImagesPipeline

Scrapy 自帶 ImagesPipeline 和 FilePipeline 用於圖片和文件下載，最簡單使用 ImagesPipeline 只須要在 settings 中配置。

# settings.py
ITEM_PIPELINES = {
    'scrapy.pipelines.images.ImagesPipeline': 500
}

IMAGES_STORE = 'pictures'  # 圖片存儲目錄
IMAGES_MIN_HEIGHT = 400  # 小於600*400的圖片過濾
IMAGES_MIN_WIDTH = 600

# items.py
import scrapy

class PictureItem(scrapy.Item):
    image_urls = scrapy.Field()

# myspider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


from ..items import BeePicture

class PicSpider(CrawlSpider):
    name = 'pic'
    allowed_domains = ['qnong.com.cn']
    start_urls = ['http://www.qnong.com.cn/']

    rules = (
        Rule(LinkExtractor(allow=r'.*?', restrict_xpaths=('//a[@href]')), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for img_url in response.xpath('//img/@src').extract():
            item = PictureItem()
            item['image_urls'] = [response.urljoin(img_url)]
            yield item

2.2 自定義 Pipeline

默認狀況下，使用ImagePipeline組件下載圖片的時候，圖片名稱是以圖片URL的SHA1值進行保存的。

如：
圖片URL: http://www.example.com/image.jpg
SHA1結果：3afec3b4765f8f0a07b78f98c07b83f013567a0a
則圖片名稱：3afec3b4765f8f0a07b78f98c07b83f013567a0a.jpg

想要以自定義圖片文件名須要重寫 ImagesPipeline 的file_path方法。參考：https://doc.scrapy.org/en/lat...。

# settings.py
ITEM_PIPELINES = {
    'qnong.pipelines.MyImagesPipeline': 500,
}

# items.py
import scrapy

class PictureItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()
    image_paths = scrapy.Field()

# myspider.py
from scrapy.linkextractors import LinkExtractor
from scrapy.spiders import CrawlSpider, Rule


from ..items import BeePicture

class PicSpider(CrawlSpider):
    name = 'pic'
    allowed_domains = ['qnong.com.cn']
    start_urls = ['http://www.qnong.com.cn/']

    rules = (
        Rule(LinkExtractor(allow=r'.*?', restrict_xpaths=('//a[@href]')), callback='parse_item', follow=True),
    )

    def parse_item(self, response):
        for img_url in response.xpath('//img/@src').extract():
            item = PictureItem()
            item['image_urls'] = [response.urljoin(img_url)]
            yield item

# pipelines.py
from scrapy.exceptions import DropItem
from scrapy.pipelines.images import ImagesPipeline
import scrapy

class MyImagesPipeline(ImagesPipeline):
    def get_media_requests(self, item, info):
        for img_url in item['image_urls']:
            yield scrapy.Request(img_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem('Item contains no images')
        item['image_paths'] = image_paths
        return item

    def file_path(self, request, response=None, info=None):
        image_guid = request.url.split('/')[-1]
        return 'full/%s' % (image_guid)

2.3 FilesPipeline 和 ImagesPipeline 工做流程

FilesPipeline

在一個爬蟲裏，你抓取一個項目，把其中圖片的URL放入 file_urls 組內。
項目從爬蟲內返回，進入項目管道。
當項目進入 FilesPipeline，file_urls 組內的 URLs 將被 Scrapy 的調度器和下載器（這意味着調度器和下載器的中間件能夠複用）安排下載，當優先級更高，會在其餘頁面被抓取前處理。項目會在這個特定的管道階段保持「locker」的狀態，直到完成文件的下載（或者因爲某些緣由未完成下載）。
當文件下載完後，另外一個字段(files)將被更新到結構中。這個組將包含一個字典列表，其中包括下載文件的信息，好比下載路徑、源抓取地址（從 file_urls 組得到）和圖片的校驗碼(checksum)。 files 列表中的文件順序將和源 file_urls 組保持一致。若是某個圖片下載失敗，將會記錄下錯誤信息，圖片也不會出如今 files 組中。

ImagesPipeline

在一個爬蟲裏，你抓取一個項目，把其中圖片的 URL 放入 images_urls 組內。
項目從爬蟲內返回，進入項目管道。
當項目進入 Imagespipeline，images_urls 組內的URLs將被Scrapy的調度器和下載器（這意味着調度器和下載器的中間件能夠複用）安排下載，當優先級更高，會在其餘頁面被抓取前處理。項目會在這個特定的管道階段保持「locker」的狀態，直到完成文件的下載（或者因爲某些緣由未完成下載）。
當文件下載完後，另外一個字段(images)將被更新到結構中。這個組將包含一個字典列表，其中包括下載文件的信息，好比下載路徑、源抓取地址（從 images_urls 組得到）和圖片的校驗碼(checksum)。 images 列表中的文件順序將和源 images_urls 組保持一致。若是某個圖片下載失敗，將會記錄下錯誤信息，圖片也不會出如今 images 組中。