scrapy框架的持久化存儲

時間 2019-12-09

原文原文鏈接

一 . 基於終端指令的持久化存儲

保證爬蟲文件的parse方法中有可迭代類型對象（一般爲列表or字典）的返回，該返回值能夠經過終端指令的形式寫入指定格式的文件中進行持久化操做。html

執行輸出指定格式進行存儲:  將爬取到的數據寫入不一樣格式的文件中進行存儲
    scrapy crawl  爬蟲名稱  -o  xxx.json
    scrapy  crawl  爬蟲名稱  - o   xxx.xml
    scrapy   crawl   爬蟲名稱   - o  xxx.csv

二 . 基於管道的持久化存儲

　　scrapy框架已經爲咱們專門集成了高效 , 便捷的持久化操做功能,咱們直接用便可.python

　　在使用scrapy的持久化操做功能以前,咱們要知道這兩個文件是什麼 ? mysql

            item.py :  數據結構模板文件,  定義數據屬性
            pipelines.py:    管道文件.   接收數據(items)   進行持久化操做


        持久化流程:

            1.  爬蟲文件爬到數據後,  須要將數據封裝到  items   對象中
            2.  使用yield   關鍵字將items   對象提交給  pipelines 管道進行持久化操做.
             3.   在管道文件中的 process_item 方法中接收爬蟲文件提交過來的item 對象, 而後編寫持久化存儲的代碼將item對象中存儲的數據進行持久化存儲
             4.settings.py配置文件中開啓管道

三 . 實例

1 . 進行本地的持久化存儲

　　⑴ . 爬蟲文件 first_hh.py redis

# -*- coding: utf-8 -*-
import scrapy
from frist_boll.items import FristBollItem


class FirstHhSpider(scrapy.Spider):
    # 爬蟲文件的名稱
    name = 'first_hh'
    # 容許訪問的域名 , 可是有時候圖片的地址與訪問的網站不是同一域名,
    # 爲了能夠正常訪問,就註釋了就好
    # allowed_domains = ['www.xxx.com']
    # 起始的url列表,能夠放多個url
    start_urls = ['https://www.qiushibaike.com/text/']

    def parse(self, response):
        div_list = response.xpath("//div[@id='content-left']/div")
        # all_data = []

        for div in div_list:
            # 若是xpath返回的列表中只有一個類別元素-就用extract_first
            # 列表中有多個列表元素-用extract()
            author = div.xpath('./div[1]/a[2]/h2/text()').extract_first()
            content = div.xpath('./a[1]/div/span//text()').extract()

            # 儲存的時候要是字符串,不能是列表
            content = "".join(content)

            # 實例化item對象
            item = FristBollItem()
            # 將解析到的數據所有封裝到item中,傳給管道文件pipelines.py
            item['author'] = author
            item['content'] = content
            # 將item提交給管道文件(pipelines.py)
            yield item

⑵ . items.py 文件 sql

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html


class FristBollPipeline(object):
    f1 = None  #定義一個文件描述符屬性

    # 打開文件,由於函數process_item 要進行循環,全部只要打開一次文件就能夠,
    # 下面都是重寫父類方法,全部spider參數必須有
    def open_spider(self, spider):
        print("start")
        self.f1 = open('./趣事百科.csv', 'w', encoding='utf-8')

　　 #由於該方法會被執行調用屢次，因此文件的開啓和關閉操做寫在了另外兩個只會各自執行一次的方法中。
    def process_item(self, item, spider):
         #將爬蟲程序提交的item進行持久化存儲
        self.f1.write(item['author'] + '\n' + item['content'])
        print()
        return item

    # 結束爬蟲 , 關閉文件
    def close_spider(self, spider):
        print("end")
        self.f1.close()

⑷ . 配置文件 : settings.py數據庫

#開啓管道
ITEM_PIPELINES = {
    'secondblood.pipelines.SecondbloodPipeline': 300, #300表示爲優先級，值越小優先級越高
}

2 . 基於mysql的管道存儲

爬取boss直聘的爬蟲職位json

　　先將 mysql打開,並建立一個文件夾和建立個文件,並且你要存儲的字段也要建立好,scrapy框架只負責添加數據到mysql,不負責建立.數據結構

　建立項目 : scrapy startproject boss python爬蟲

　　進入目錄 : cd boss 框架

　　建立應用 : scrapy genspider boss_hh www.xxx.com

　　⑴ . 爬蟲文件 : boss_hh.py

# -*- coding: utf-8 -*-
import scrapy
from boss.items import BossItem


class BossHhSpider(scrapy.Spider):
    name = 'boss_hh'
    # allowed_domains = ['www.xxx.com']
    start_urls = ['https://www.zhipin.com/job_detail/?query=%E7%88%AC%E8%99%AB&scity=101010100&industry=&position=']
    # 由於有分頁的效果,並且第一頁與後面幾頁的url不同,全部要進行判斷 page=%d  進行頁碼
    url = 'https://www.zhipin.com/c101010100/?query=python爬蟲&page=%d&ka=page-2'
    # 默認第一頁,起始頁面
    page = 1

    def parse(self, response):
        li_list = response.xpath('//div[@class="job-list"]/ul/li')
        for li in li_list:
            job_name = li.xpath('.//div[@class="info-primary"]/h3/a/div/text()').extract_first()
            salary = li.xpath('.//div[@class="info-primary"]/h3/a/span/text()').extract_first()
            company = li.xpath('.//div[@class="company-text"]/h3/a/text()').extract_first()
            # 實例化
            item = BossItem()
            # 解析到的數據加到item對象中
            item['job_name'] = job_name
            item['salary'] = salary
            item['company'] = company
            # 提交到管道進行保存
            yield item
        if self.page <= 3:
            print("執行 if 判斷 ")
            # 從第二頁開始,封裝集成了一個新的頁碼的url
            self.page += 1
            # 佔位
            new_url = format(self.url % self.page)

            # 手動請求的發送,callback 表示指定的解析方式 , 回調
　　　　　　　　#遞歸爬取數據：callback參數的值爲回調函數（將url請求後，獲得的相應數據繼續進行parse解析），遞歸調用parse函數

            yield scrapy.Request(url=new_url,callback=self.parse)

⑵ . items文件 :

# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# https://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class BossItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()

    job_name = scrapy.Field()
    salary = scrapy.Field()
    company = scrapy.Field()

⑶ . 管道文件 : pipelines.py

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from redis import Redis


class mysqlPipeline(object):
    conn = None
    cursor = None
     
    # 打開mysql
    def open_spider(self, spider):
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', password='', db='scrapy', charset='utf8')


    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            # 將數據寫入mysql
            self.cursor.execute('insert into boss(job_name,salary,company) values("%s","%s","%s")' % (item['job_name'], item['salary'], item['company']))

            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        return item       # 注意
    # 關閉mysql
    def close_spider(self, spider):
        self.conn.close()
        self.cursor.close()

⑷ . settings文件 :

ITEM_PIPELINES = {
   # 'boss.pipelines.BossPipeline': 300,
   'boss.pipelines.mysqlPipeline': 301,
}
# 前一個屬於子代文件,若是隻想執行mysql,就能夠將前面的註釋

注意 : 在管道文件 : pipelines.py 中 , 函數 mysqlPipeline,最後返回了一個 return item . 這個是給下面的函數進行的,由於能夠在一個管道文件中進行本地硬盤存儲和mysql存儲,能夠經過 settings裏面的開啓管道的優先級設置那個先進行存儲,return item 就會從先執行的傳給後執行的. 必定要返回,不然後面的不會執行

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://doc.scrapy.org/en/latest/topics/item-pipeline.html

import pymysql
from redis import Redis


class BossPipeline(object):
    f1 = None

    def open_spider(self, spider):
        print("start ----------->")
        self.f1 = open('boss.txt', 'w', encoding='utf-8')

    def close_spider(self, spider):
        print("end ------------->")
        self.f1.close()

    def process_item(self, item, spider):
        self.f1.write(item['job_name'] + ":" + item['salary'] + ":" + item['company'] + '\n')
        return item  # 返回給mysql進行


class mysqlPipeline(object):
    conn = None
    cursor = None

    def open_spider(self, spider):
        self.conn = pymysql.connect(host='127.0.0.1', port=3306, user='root', password='', db='scrapy', charset='utf8')


    def process_item(self, item, spider):
        self.cursor = self.conn.cursor()
        try:
            self.cursor.execute('insert into boss(job_name,salary,company) values("%s","%s","%s")' % (item['job_name'], item['salary'], item['company']))
            self.conn.commit()
        except Exception as e:
            self.conn.rollback()
        return item

    def close_spider(self, spider):
        self.conn.close()
        self.cursor.close()

管道文件 : 本地和mysql都存儲

3 . 基於redis的管道存儲

只須要將管道文件修改就能夠了 !!!!!

from redis import Redis

class redisPileLine(object):
    conn = None

    def open_spider(self, spider):
        self.conn = Redis(host='127.0.0.1', port=6379)
        print(self.conn)

    def process_item(self, item, spider):
        # print(item)
        dic = {
            'name': item['job_name'],
            'salary': item['salary'],
            'company': item['company']
        }
        self.conn.lpush('boss', dic)

settings文件 :

ITEM_PIPELINES = {
    'qiubaiPro.pipelines.QiubaiproPipelineByRedis': 300,
}

四 . 總結

若是最終須要將爬取到的數據值一份存儲到磁盤文件，一份存儲到數據庫中，則應該如何操做scrapy？

　　答 : 管道文件中的代碼

#該類爲管道類，該類中的process_item方法是用來實現持久化存儲操做的。
class DoublekillPipeline(object):

    def process_item(self, item, spider):
        #持久化操做代碼 （方式1：寫入磁盤文件）
        return item

#若是想實現另外一種形式的持久化操做，則能夠再定製一個管道類：
class DoublekillPipeline_db(object):

    def process_item(self, item, spider):
        #持久化操做代碼 （方式1：寫入數據庫）
        return item

settings.py文件的開啓管道的代碼 :

#下列結構爲字典，字典中的鍵值表示的是即將被啓用執行的管道文件和其執行的優先級。
ITEM_PIPELINES = {
   'doublekill.pipelines.DoublekillPipeline': 300,
    'doublekill.pipelines.DoublekillPipeline_db': 200,
}

#上述代碼中，字典中的兩組鍵值分別表示會執行管道文件中對應的兩個管道類中的process_item方法，實現兩種不一樣形式的持久化操做。