如何用腳本方式啓動scrapy爬蟲

時間 2019-11-19

原文原文鏈接

衆所周知，直接經過命令行scrapy crawl yourspidername能夠啓動項目中名爲yourspidername的爬蟲。在python腳本中能夠調用cmdline模塊來啓動命令行：html

$ cat yourspider1start.py
from scrapy import cmdline

# 方法 1
cmdline.execute('scrapy crawl yourspidername'.split())

# 方法 2
sys.argv = ['scrapy', 'crawl', 'down_info_spider']
cmdline.execute()

# 方法 3, 建立子進程執行外部程序。方法僅僅返回外部程序的執行結果。0表示執行成功。
os.system('scrapy crawl down_info_spider')

# 方法 4
import subprocess
subprocess.Popen('scrapy crawl down_info_spider')

其中，在方法三、4中，推薦subprocesspython

subprocess module intends to replace several other, older modules and functions, such as:react

os.system
os.spawn
os.popen
popen2.
commands.shell

經過其返回值的poll方法能夠判斷子進程是否執行結束編程

咱們也能夠直接經過shell腳本每隔2秒啓動全部爬蟲：bash

$ cat startspiders.sh
#!/usr/bin/env bash
count=0
while [ $count -lt $1 ];
do
  sleep 2 
  nohup python yourspider1start.py >/dev/null 2>&1 &
  nohup python yourspider2start.py >/dev/null 2>&1 &
  let count+=1
done

以上方法本質上都是啓動scrapy命令行。如何經過調用scrapy內部函數，在編程方式下啓動爬蟲呢？網絡

官方文檔給出了兩個scrapy工具：框架

scrapy.crawler.CrawlerRunner, runs crawlers inside an already setup Twisted reactor
scrapy.crawler.CrawlerProcess, 父類是CrawlerRunner

scrapy框架基於Twisted異步網絡庫，CrawlerRunner和CrawlerProcess幫助咱們從Twisted reactor內部啓動scrapy。異步

直接使用CrawlerRunner能夠更精細的控制crawler進程，要手動指定Twisted reactor關閉後的回調函數。指定若是不打算在應用程序中運行更多的Twisted reactor，使用子類CrawlerProcess則更合適。scrapy

下面簡單是文檔中給的用法示例：

# encoding: utf-8
__author__ = 'fengshenjie'
from twisted.internet import reactor
from scrapy.utils.project import get_project_settings

def run1_single_spider():
    '''Running spiders outside projects
    只調用spider，不會進入pipeline'''
    from scrapy.crawler import CrawlerProcess
    from scrapy_test1.spiders import myspider1
    process = CrawlerProcess({
        'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)'
    })

    process.crawl(myspider1)
    process.start()  # the script will block here until the crawling is finished

def run2_inside_scrapy():
    '''會啓用pipeline'''
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess(get_project_settings())
    process.crawl('spidername') # scrapy項目中spider的name值
    process.start()

def spider_closing(arg):
    print('spider close')
    reactor.stop()

def run3_crawlerRunner():
    '''若是你的應用程序使用了twisted，建議使用crawlerrunner 而不是crawlerprocess
    Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl method.
    '''
    from scrapy.crawler import CrawlerRunner
    runner = CrawlerRunner(get_project_settings())

    # 'spidername' is the name of one of the spiders of the project.
    d = runner.crawl('spidername')
    
    # stop reactor when spider closes
    # d.addBoth(lambda _: reactor.stop())
    d.addBoth(spider_closing) # 等價寫法

    reactor.run()  # the script will block here until the crawling is finished

def run4_multiple_spider():
    from scrapy.crawler import CrawlerProcess
    process = CrawlerProcess()

    from scrapy_test1.spiders import myspider1, myspider2
    for s in [myspider1, myspider2]:
        process.crawl(s)
    process.start()

def run5_multiplespider():
    '''using CrawlerRunner'''
    from twisted.internet import reactor
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging

    configure_logging()
    runner = CrawlerRunner()
    from scrapy_test1.spiders import myspider1, myspider2
    for s in [myspider1, myspider2]:
        runner.crawl(s)

    d = runner.join()
    d.addBoth(lambda _: reactor.stop())

    reactor.run()  # the script will block here until all crawling jobs are finished

def run6_multiplespider():
    '''經過連接(chaining) deferred來線性運行spider'''
    from twisted.internet import reactor, defer
    from scrapy.crawler import CrawlerRunner
    from scrapy.utils.log import configure_logging
    configure_logging()
    runner = CrawlerRunner()

    @defer.inlineCallbacks
    def crawl():
        from scrapy_test1.spiders import myspider1, myspider2
        for s in [myspider1, myspider2]:
            yield runner.crawl(s)
        reactor.stop()

    crawl()
    reactor.run()  # the script will block here until the last crawl call is finished


if __name__=='__main__':
    # run4_multiple_spider()
    # run5_multiplespider()
    run6_multiplespider()