scrapy之360圖片爬取

#今日目標

**scrapy之360圖片爬取**

今天要爬取的是360美女圖片,首先分析頁面得知網頁是動態加載,故須要先找到網頁連接規律,
而後調用ImagesPipeline類實現圖片爬取

*代碼實現*
so.py 
```
# -*- coding: utf-8 -*-
import scrapy
import json
from ..items import SoItem

class SoSpider(scrapy.Spider):
    name = 'so'
    allowed_domains = ['imaeg.os.com']

    # 重寫
    def start_requests(self):
        url = 'http://image.so.com/zjl?ch=beauty&sn={}&listtype=new&temp=1'
        # 生成5頁的地址,交給調度器
        for i in range(5):
            sn = i*30
            full_url = url.format(sn)
            yield scrapy.Request(
                url = full_url,
                callback = self.parse_image,
                dont_filter=False
            )

    def parse_image(self,response):
        html = json.loads(response.text)
        # 提取圖片連接
        for img in html['list']:
            item = SoItem()
            item['img_link'] = img['qhimg_url']

            yield item
```
item.py ```
import scrapy class SoItem(scrapy.Item): # define the fields for your item here like: # 圖片連接 img_link = scrapy.Field() ``` pipelines.py ``` # 導入scrapy的圖片管道類 from scrapy.pipelines.images import ImagesPipeline import scrapy # 1. 繼承 ImagesPipeline # 2. 重寫 類內方法 class SoPipeline(ImagesPipeline): def get_media_requests(self, item, info): # 把圖片連接發給調度器 yield scrapy.Request(url = item['img_link'],dont_filter=False) ``` settings.py ``` # Obey robots.txt rules ROBOTSTXT_OBEY = False CONCURRENT_REQUESTS = 10 DEFAULT_REQUEST_HEADERS = { 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', 'Accept-Language': 'en', 'User-Agent':'Mozilla/5.0', } ITEM_PIPELINES = { 'So.pipelines.SoPipeline': 300, } IMAGES_STORE = '/home/ccc/image/' #我的保存路徑 ```
相關文章
相關標籤/搜索