小白爬蟲scrapy第四篇

時間 2021-04-11

標籤 html python 數據庫 json scrapy ide 測試 spa 命令行 code 欄目網絡爬蟲简体版

原文原文鏈接

在上篇中沒有說到啓動如何去啓動,scrapy是使用cmd命令行去啓動的
我們用scrapy的cmdline去啓動
命名point.pyhtml

# 導入cmdline 中的execute用來執行cmd命令
from scrapy.cmdline import execute
# 執行cmd命令參數爲[ scrapy, 爬蟲, 爬蟲名稱]
execute(['scrapy', 'crawl', 'AiquerSpider'])

這個文件放在項目根目錄下
如圖:

若是各位同窗按照個人前面幾篇的步驟寫完的話能夠用這個去測試一下(把部分代碼註釋去了),你會發現有好多神祕的藍色連接,哇啊啊啊啊!!!!!個人右手在燃燒!!!!!!!python

先在我們去保存數據吧!我這幾天寫項目需求寫到崩潰就不去作具體數據處理了,直接貼代碼數據庫

# -*- coding: utf-8 -*-

# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: http://doc.scrapy.org/en/latest/topics/item-pipeline.html
import json

class AiquerPipeline(object):
    def __init__(self):
        # 打開文件
        self.file = open('data.json', 'w', encoding='utf-8')

    def process_item(self, item, spider):
        # 讀取item中的數據
        line = json.dumps(dict(item), ensure_ascii=False) + "\n"
        # 寫入文件
        self.file.write(line)
        # 返回item
        return item

        # 該方法在spider被開啓時被調用。
        def open_spider(self, spider):

            pass

        # 該方法在spider被關閉時被調用。
        def close_spider(self, spider):

            pass

在運行這個東西以前是要註冊的,回到settings.py裏面找到Configure item pipelines,將下面的註釋去掉就好了,我們沒有具體需求因此不用改優先級別json

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'AiQuer.pipelines.AiquerPipeline': 300,
}

AiQuer.pipelines.AiquerPipeline是爲你要註冊的類，右側的’300’爲該Pipeline的優先級，範圍1～1000，越小越先執行。
沒有作具體數據處理了,直接把他們保存爲json數據了,很長很長一段眼花
下一篇是如何去保存在數據庫中scrapy