scrapy持久化存儲

時間 2019-11-19

標籤 scrapy 持久存儲欄目 Python 简体版

原文原文鏈接

持久化存儲操做：

a.磁盤文件

　　a) 基於終端指令

　　　　i. 保證parse方法返回一個可迭代類型的對象（存儲解析到的頁面內容）

　　　　ii. 使用終端指令完成數據存儲到指定磁盤文件的操做

　　　　　　1. scrapy crawl 爬蟲文件名稱 -o 磁盤文件.後綴如（test.csv）

　　b)基於管道

　　　　i. items: 存儲解析到的頁面數據

　　　　ii. piplines: 處理是持久化存儲的相關操做

　　　　iii. 代碼實現流程：

　　　　　　1. 將解析到的頁面數據存儲到items對象中

　　　　　　2.使用yield 關鍵字將items提交給管道文件進行處理

　　　　　　3.在管道文件中編寫代碼完成數據存儲的操做（piplines）

　　　　　　4.在配置文件中開啓管道操做

b管道操做的代碼以下：

spiders/qiushibai.py

# -*- coding: utf-8 -*-
import scrapy
from qiubai.items import QiubaiItem


class QiushibaiSpider(scrapy.Spider):
    name = 'qiushibai'
    # allowed_domains = ['www.qiushibaike.com/text/']
    start_urls = ['http://www.qiushibaike.com/text//']

    def parse(self, response):
        #    建議你們使用xpath進行指定內容的解析（框架集成了xpath解析的接口）
        #    段子的內容和做者
        div_list = response.xpath('//div[@id="content-left"]/div')
        # data_list = []
        for div in div_list:
            # xpath解析到的指定內容被存儲到了Selector對象
            # extract()該方法能夠將Selector對象中存儲的數據值拿到
            # author = div.xpath("./div/a[2]/h2/text()").extract()[0]
            # extract_first() == extract()[0]
            author = div.xpath("./div/a[2]/h2/text()").extract_first()
            content = div.xpath('.//div[@class="content"]/span/text()').extract_first()
            # print(author, '---------------')
            # print(content)
            # data = {
            #     "author": author,
            #     "content": content
            # }

            # 將解析到數值的數據存儲到item對象
            item = QiubaiItem()
            item["author"] = author
            item["content"] = content
            # 將item對象提交給管道
            yield item

qiubai/items.py

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class QiubaiItem(scrapy.Item):
    # define the fields for your item here like:
    author = scrapy.Field()
    content = scrapy.Field()

qiubai/pipelines.py # 得如今settings.py裏搜索pipeline 數字300爲優先級

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class QiubaiPipeline(object):
    fp = None
    # 該方法只會在爬蟲開始運行的時候調用一次

    def open_spider(self, spider):
        print("開始爬蟲")
        self.fp = open("./data_record.txt", "w", encoding="utf-8")
    # 該方法就能夠接受爬蟲文件中提交過來的item對象，而且對item對象中存儲的頁面數據進行持久化存儲
    # 參數item就是接收到的item對象
    # 每當爬蟲文件向管道提交一次item,則該方法就會被執行一次

    def process_item(self, item, spider):
        # 取出item中的對象存儲數據
        author = item["author"]
        content = item["content"]
        # 持久化存儲
        self.fp.write(author + ":" + content + "\n\n\n")
        return item

    # 該方法只會在爬蟲結束時調用一次

    def close_spider(self, spider):
        print("爬蟲結束")
        self.fp.close()

基於mysql存儲：

i. items: 存儲解析到的頁面數據

　　　　ii. piplines: 處理是持久化存儲的相關操做

　　　　iii. 代碼實現流程：

　　　　　　1. 將解析到的頁面數據存儲到items對象中

　　　　　　2.使用yield 關鍵字將items提交給管道文件進行處理

　　　　　　3.在管道文件中編寫代碼完成數據存儲的操做（piplines）

　　　　　　4.在配置文件中開啓管道操做

pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html
import pymysql


class QiubaiPipeline(object):
    conn = None
    # 該方法只會在爬蟲開始運行的時候調用一次
    cursor = None

    def open_spider(self, spider):
        print("開始爬蟲")
        # self.fp = open("./data_record.txt", "w", encoding="utf-8")
        self.conn = pymysql.Connect(host="127.0.0.1", port=3306, user="root", password="1228", db="qiubai")
    # 該方法就能夠接受爬蟲文件中提交過來的item對象，而且對item對象中存儲的頁面數據進行持久化存儲
    # 參數item就是接收到的item對象
    # 每當爬蟲文件向管道提交一次item,則該方法就會被執行一次

    def process_item(self, item, spider):
        # 取出item中的對象存儲數據
        author = item["author"]
        content = item["content"]
        # 持久化mysql存儲

        sql = "insert into data values ('%s', '%s')" % (author, content)
        print(sql)
        # self.fp.write(author + ":" + content + "\n\n\n")
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute(sql)
            self.conn.commit()
        except Exception as e:
            print(e)
            self.conn.rollback()

        return item

    # 該方法只會在爬蟲結束時調用一次

    def close_spider(self, spider):
        print("爬蟲結束")
        # self.fp.close()
        self.conn.close()