scrapy框架中多個spider,tiems,pipelines的使用及運行方法

用scrapy只建立一個項目,建立多個spider,每一個spider指定items,pipelines.啓動爬蟲時只寫一個啓動腳本就能夠所有同時啓動。python

本文代碼已上傳至github,連接在文未。git

一,建立多個spider的scrapy項目github

scrapy startproject mymultispider
cd mymultispider
scrapy genspider myspd1 sina.com.cn
scrapy genspider myspd2 sina.com.cn
scrapy genspider myspd3 sina.com.cn

二,運行方法dom

1.爲了方便觀察,在spider中分別打印相關信息scrapy

import scrapy
class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']

def parse(self, response): print('myspd1')

其餘如myspd2,myspd3分別打印相關內容。ide

2.多個spider運行方法有兩種,第一種寫法比較簡單,在項目目錄下建立crawl.py文件,內容以下url

from scrapy.crawler import CrawlerProcess
from scrapy.utils.project import get_project_settings

process = CrawlerProcess(get_project_settings())

# myspd1是爬蟲名
process.crawl('myspd1')
process.crawl('myspd2')
process.crawl('myspd3')

process.start()

爲了觀察方便,可在settings.py文件中限定日誌輸出spa

LOG_LEVEL = 'ERROR'

右鍵運行此文件便可,輸出以下3d

 

 

 3.第二種運行方法爲修改crawl源碼,能夠從官方的github中找到:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py日誌

在spiders目錄的同級目錄下建立一個mycmd目錄,並在該目錄中建立一個mycrawl.py,將crawl源碼複製進來,修改其中的run方法,改成以下內容

def run(self, args, opts):
    # 獲取爬蟲列表
    spd_loader_list = self.crawler_process.spider_loader.list()
    # 遍歷各爬蟲
    for spname in spd_loader_list or args:
        self.crawler_process.crawl(spname, **opts.spargs)
        print("此時啓動的爬蟲:" + spname)
    self.crawler_process.start()

在該文件的目錄下建立初始化文件__init__.py

完成後機構目錄以下

 

 使用命令啓動爬蟲

scrapy mycrawl --nolog

輸出以下:

 

 

三,指定items

1,這個比較簡單,在items.py文件內建立相應的類,在spider中引入便可

items.py

import scrapy


class MymultispiderItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

class Myspd1spiderItem(scrapy.Item):
    name = scrapy.Field()

class Myspd2spiderItem(scrapy.Item):
    name = scrapy.Field()

class Myspd3spiderItem(scrapy.Item):
    name = scrapy.Field()

spider內,例myspd1

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

四,指定pipelines

1,這個也有兩種方法,方法一,定義多個pipeline類:

pipelines.py文件內:

class Myspd1spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

class Myspd2spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

class Myspd3spiderPipeline(object):
    def process_item(self,item,spider):
        print(item['name'])
        return item

1.1settings.py文件開啓管道

ITEM_PIPELINES = {
   # 'mymultispider.pipelines.MymultispiderPipeline': 300,
   'mymultispider.pipelines.Myspd1spiderPipeline': 300,
   'mymultispider.pipelines.Myspd2spiderPipeline': 300,
   'mymultispider.pipelines.Myspd3spiderPipeline': 300,
}

1.2spider中設置管道,例myspd1

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']
    custom_settings = {
        'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    }

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

指定管道的代碼

custom_settings = {
        'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    }

1.3運行crawl文件,運行結果以下

 

 2,方法二,在pipelines.py文件內判斷是哪一個爬蟲的結果

2.1 pipelines.py文件內

class MymultispiderPipeline(object):
    def process_item(self, item, spider):
        if spider.name == 'myspd1':
            print('myspd1的pipelines')
        elif spider.name == 'myspd2':
            print('myspd2的pipelines')
        elif spider.name == 'myspd3':
            print('myspd3的pipelines')
        return item

2.2 settings.py文件內只開啓MymultispiderPipeline這個管道文件

ITEM_PIPELINES = {
   'mymultispider.pipelines.MymultispiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd1spiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd2spiderPipeline': 300,
   # 'mymultispider.pipelines.Myspd3spiderPipeline': 300,
}

2.3spider中屏蔽掉指定pipelines的相關代碼

# -*- coding: utf-8 -*-
import scrapy
from mymultispider.items import Myspd1spiderItem

class Myspd1Spider(scrapy.Spider):
    name = 'myspd1'
    allowed_domains = ['sina.com.cn']
    start_urls = ['http://sina.com.cn/']
    # custom_settings = {
    #     'ITEM_PIPELINES': {'mymultispider.pipelines.Myspd1spiderPipeline': 300},
    # }

    def parse(self, response):
        print('myspd1')
        item = Myspd1spiderItem()
        item['name'] = 'myspd1的pipelines'
        yield item

2.4 運行crawl.py文件,結果以下

 

 

代碼git地址:https://github.com/terroristhouse/crawler

 

python系列教程:

連接:https://pan.baidu.com/s/10eUCb1tD9GPuua5h_ERjHA 

提取碼:h0td 

 

相關文章
相關標籤/搜索