同時運行多個scrapy爬蟲的幾種方法(自定義scrapy項目命令)

  試想一下,前面作的實驗和例子都只有一個spider。然而,現實的開發的爬蟲確定不止一個。既然這樣,那麼就會有以下幾個問題:一、在同一個項目中怎麼建立多個爬蟲的呢?二、多個爬蟲的時候是怎麼將他們運行起來呢?html

  說明:本文章是基於前面幾篇文章和實驗的基礎上完成的。若是您錯過了,或者有疑惑的地方能夠在此查看:python

  安裝python爬蟲scrapy踩過的那些坑和編程外的思考mysql

  scrapy爬蟲成長日記之建立工程-抽取數據-保存爲json格式的數據react

  scrapy爬蟲成長日記之將抓取內容寫入mysql數據庫git

  如何讓你的scrapy爬蟲再也不被bangithub

  1、建立spidersql

  一、建立多個spider,scrapy genspider spidername domainshell

scrapy genspider CnblogsHomeSpider cnblogs.com

  經過上述命令建立了一個spider name爲CnblogsHomeSpider的爬蟲,start_urls爲http://www.cnblogs.com/的爬蟲數據庫

  二、查看項目下有幾個爬蟲scrapy list編程

[root@bogon cnblogs]# scrapy list
CnblogsHomeSpider
CnblogsSpider

  由此能夠知道個人項目下有兩個spider,一個名稱叫CnblogsHomeSpider,另外一個叫CnblogsSpider。

  更多關於scrapy命令可參考:http://doc.scrapy.org/en/latest/topics/commands.html

  2、讓幾個spider同時運行起來

  如今咱們的項目有兩個spider,那麼如今咱們怎樣才能讓兩個spider同時運行起來呢?你可能會說寫個shell腳本一個個調用,也可能會說寫個python腳本一個個運行等。然而我在stackoverflow.com上看到。的確也有不上前輩是這麼實現。然而官方文檔是這麼介紹的。

  一、Run Scrapy from a script

import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider(scrapy.Spider):
    # Your spider definition
    ...

process = CrawlerProcess({
    'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
})

process.crawl(MySpider)
process.start() # the script will block here until the crawling is finished

  這裏主要經過scrapy.crawler.CrawlerProcess來實如今腳本里運行一個spider。更多的例子能夠在此查看:https://github.com/scrapinghub/testspiders

  二、Running multiple spiders in the same process

  • 經過CrawlerProcess
import scrapy
from scrapy.crawler import CrawlerProcess

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

process = CrawlerProcess()
process.crawl(MySpider1)
process.crawl(MySpider2)
process.start() # the script will block here until all crawling jobs are finished
  • 經過CrawlerRunner
import scrapy
from twisted.internet import reactor
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()
runner.crawl(MySpider1)
runner.crawl(MySpider2)
d = runner.join()
d.addBoth(lambda _: reactor.stop())

reactor.run() # the script will block here until all crawling jobs are finished
  • 經過CrawlerRunner和連接(chaining) deferred來線性運行
from twisted.internet import reactor, defer
from scrapy.crawler import CrawlerRunner
from scrapy.utils.log import configure_logging

class MySpider1(scrapy.Spider):
    # Your first spider definition
    ...

class MySpider2(scrapy.Spider):
    # Your second spider definition
    ...

configure_logging()
runner = CrawlerRunner()

@defer.inlineCallbacks
def crawl():
    yield runner.crawl(MySpider1)
    yield runner.crawl(MySpider2)
    reactor.stop()

crawl()
reactor.run() # the script will block here until the last crawl call is finished

  這是官方文檔提供的幾種在script裏面運行spider的方法。

  3、經過自定義scrapy命令的方式來運行

  建立項目命令可參考:http://doc.scrapy.org/en/master/topics/commands.html?highlight=commands_module#custom-project-commands

  一、建立commands目錄

mkdir commands

  注意:commands和spiders目錄是同級的

  二、在commands下面添加一個文件crawlall.py

  這裏主要經過修改scrapy的crawl命令來完成同時執行spider的效果。crawl的源碼能夠在此查看:https://github.com/scrapy/scrapy/blob/master/scrapy/commands/crawl.py

from scrapy.commands import ScrapyCommand  
from scrapy.crawler import CrawlerRunner
from scrapy.utils.conf import arglist_to_dict

class Command(ScrapyCommand):
  
    requires_project = True
  
    def syntax(self):  
        return '[options]'  
  
    def short_desc(self):  
        return 'Runs all of the spiders'  

    def add_options(self, parser):
        ScrapyCommand.add_options(self, parser)
        parser.add_option("-a", dest="spargs", action="append", default=[], metavar="NAME=VALUE",
                          help="set spider argument (may be repeated)")
        parser.add_option("-o", "--output", metavar="FILE",
                          help="dump scraped items into FILE (use - for stdout)")
        parser.add_option("-t", "--output-format", metavar="FORMAT",
                          help="format to use for dumping items with -o")

    def process_options(self, args, opts):
        ScrapyCommand.process_options(self, args, opts)
        try:
            opts.spargs = arglist_to_dict(opts.spargs)
        except ValueError:
            raise UsageError("Invalid -a value, use -a NAME=VALUE", print_help=False)

    def run(self, args, opts):
        #settings = get_project_settings()

        spider_loader = self.crawler_process.spider_loader
        for spidername in args or spider_loader.list():
            print "*********cralall spidername************" + spidername
            self.crawler_process.crawl(spidername, **opts.spargs)

        self.crawler_process.start()

  這裏主要是用了self.crawler_process.spider_loader.list()方法獲取項目下全部的spider,而後利用self.crawler_process.crawl運行spider

  三、commands命令下添加__init__.py文件

touch __init__.py

  注意:這一步必定不能省略。我就是由於這個問題折騰了一天。囧。。。就怪本身半路出家的吧。

  若是省略了會報這樣一個異常

Traceback (most recent call last):
  File "/usr/local/bin/scrapy", line 9, in <module>
    load_entry_point('Scrapy==1.0.0rc2', 'console_scripts', 'scrapy')()
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 122, in execute
    cmds = _get_commands_dict(settings, inproject)
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 50, in _get_commands_dict
    cmds.update(_get_commands_from_module(cmds_module, inproject))
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 29, in _get_commands_from_module
    for cmd in _iter_command_classes(module):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/cmdline.py", line 20, in _iter_command_classes
    for module in walk_modules(module_name):
  File "/usr/local/lib/python2.7/site-packages/Scrapy-1.0.0rc2-py2.7.egg/scrapy/utils/misc.py", line 63, in walk_modules
    mod = import_module(path)
  File "/usr/local/lib/python2.7/importlib/__init__.py", line 37, in import_module
    __import__(name)
ImportError: No module named commands

  一開始怎麼找都找不到緣由在哪。耗了我一成天,後來到http://stackoverflow.com/上獲得了網友的幫助。再次感謝萬能的互聯網,要是沒有那道牆該是多麼的美好呀!扯遠了,繼續回來。

  四、settings.py目錄下建立setup.py(這一步去掉也沒影響,不知道官網幫助文檔這麼寫有什麼具體的意義。

from setuptools import setup, find_packages

setup(name='scrapy-mymodule',
  entry_points={
    'scrapy.commands': [
      'crawlall=cnblogs.commands:crawlall',
    ],
  },
 )

  這個文件的含義是定義了一個crawlall命令,cnblogs.commands爲命令文件目錄,crawlall爲命令名。

  5. 在settings.py中添加配置:

COMMANDS_MODULE = 'cnblogs.commands'

  6. 運行命令scrapy crawlall

  最後源碼更新至此:https://github.com/jackgitgz/CnblogsSpider

相關文章
相關標籤/搜索