scrapy學習筆記(三)：使用item與pipeline保存數據

時間 2019-11-09

標籤 scrapy 學習筆記使用 item pipeline 保存數據欄目 Python 简体版

原文原文鏈接

最近真是忙的吐血。。。css

上篇寫的是直接在爬蟲中使用mongodb，這樣不是很好，scrapy下使用item纔是正經方法。
在item中定義須要保存的內容，而後在pipeline處理item，爬蟲流程就成了這樣：html

抓取 --> 按item規則收集須要數據 -->使用pipeline處理（存儲等）

定義item,在items.py中定義抓取內容mongodb

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class GetquotesItem(scrapy.Item):
    # define the fields for your item here like:
    # 定義咱們須要抓取的內容：
    # 1.名言內容
    # 2.做者
    # 3.標籤
    content = scrapy.Field()
    author = scrapy.Field()
    tags = scrapy.Field()

咱們將數據庫的配置信息保存在setting.py文件中，方便調用數據庫

MONGODB_HOST = 'localhost'
MONGODB_PORT = 27017
MONGODB_DBNAME = 'store_quotes2'
MONGODB_TABLE = 'quotes2'

另外，在setting.py文件中一點要將pipeline註釋去掉，要否則pipeline不會起做用：數組

#ITEM_PIPELINES = {
#    'getquotes.pipelines.SomePipeline': 300,
#}

改爲scrapy

ITEM_PIPELINES = {
    'getquotes.pipelines.GetquotesPipeline': 300,
}

如今在pipeline.py中定義處理item方法：ide

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html

# 將setting導入，以使用定義內容
from scrapy.conf import settings
import pymongo

class GetquotesPipeline(object):

    # 鏈接數據庫
    def __init__(self):
        
        # 獲取數據庫鏈接信息
        host = settings['MONGODB_HOST']
        port = settings['MONGODB_PORT']
        dbname = settings['MONGODB_DBNAME']
        client = pymongo.MongoClient(host=host, port=port)
        
        # 定義數據庫
        db = client[dbname]
        self.table = db[settings['MONGODB_TABLE']]
    
    # 處理item
    def process_item(self, item, spider):
            # 使用dict轉換item，而後插入數據庫
            quote_info = dict(item)
            self.table.insert(quote_info)
            return item

相應的，myspider.py中的代碼變化一下函數

import scrapy
import pymongo

# 別忘了導入定義的item
from getquotes.items import GetquotesItem

class myspider(scrapy.Spider):

    # 設置爬蟲名稱
    name = "get_quotes"

    # 設置起始網址
    start_urls = ['http://quotes.toscrape.com']

    '''
        # 配置client，默認地址localhost，端口27017
        client = pymongo.MongoClient('localhost',27017)
        # 建立一個數據庫，名稱store_quote
        db_name = client['store_quotes']
        # 建立一個表
        quotes_list = db_name['quotes']
    '''
    def parse(self, response):

        #使用 css 選擇要素進行抓取，若是喜歡用BeautifulSoup之類的也能夠
        #先定位一整塊的quote，在這個網頁塊下進行做者、名言,標籤的抓取
        for quote in response.css('.quote'):
            '''
            # 將頁面抓取的數據存入mongodb,使用insert
            yield self.quotes_list.insert({
                'author' : quote.css('small.author::text').extract_first(),
                'tags' : quote.css('div.tags a.tag::text').extract(),
                'content' : quote.css('span.text::text').extract_first()
            })
            '''
            item = GetquotesItem()
            item['author'] = quote.css('small.author::text').extract_first()
            item['content'] = quote.css('span.text::text').extract_first()
            item['tags'] = quote.css('div.tags a.tag::text').extract()
            yield item


        # 使用xpath獲取next按鈕的href屬性值
        next_href = response.xpath('//li[@class="next"]/a/@href').extract_first()
        # 判斷next_page的值是否存在
        if next_href is not None:

            # 若是下一頁屬性值存在，則經過urljoin函數組合下一頁的url:
            # www.quotes.toscrape.com/page/2
            next_page = response.urljoin(next_href)

            #回調parse處理下一頁的url
            yield scrapy.Request(next_page,callback=self.parse)

能夠再scrapy輸出信息中看到pipeline啓用url