衆所周知,直接經過命令行scrapy crawl yourspidername
能夠啓動項目中名爲yourspidername的爬蟲。在python腳本中能夠調用cmdline模塊來啓動命令行:html
$ cat yourspider1start.py from scrapy import cmdline # 方法 1 cmdline.execute('scrapy crawl yourspidername'.split()) # 方法 2 sys.argv = ['scrapy', 'crawl', 'down_info_spider'] cmdline.execute() # 方法 3, 建立子進程執行外部程序。方法僅僅返回外部程序的執行結果。0表示執行成功。 os.system('scrapy crawl down_info_spider') # 方法 4 import subprocess subprocess.Popen('scrapy crawl down_info_spider')
其中,在方法三、4中,推薦subprocesspython
subprocess module intends to replace several other, older modules and functions, such as:react
os.system
os.spawn
os.popen
popen2.
commands.shell經過其返回值的poll方法能夠判斷子進程是否執行結束編程
咱們也能夠直接經過shell腳本每隔2秒啓動全部爬蟲:bash
$ cat startspiders.sh #!/usr/bin/env bash count=0 while [ $count -lt $1 ]; do sleep 2 nohup python yourspider1start.py >/dev/null 2>&1 & nohup python yourspider2start.py >/dev/null 2>&1 & let count+=1 done
以上方法本質上都是啓動scrapy命令行。如何經過調用scrapy內部函數,在編程方式下啓動爬蟲呢?網絡
官方文檔給出了兩個scrapy工具:框架
scrapy框架基於Twisted異步網絡庫,CrawlerRunner和CrawlerProcess幫助咱們從Twisted reactor內部啓動scrapy。異步
直接使用CrawlerRunner能夠更精細的控制crawler進程,要手動指定Twisted reactor關閉後的回調函數。指定若是不打算在應用程序中運行更多的Twisted reactor,使用子類CrawlerProcess則更合適。scrapy
下面簡單是文檔中給的用法示例:
# encoding: utf-8 __author__ = 'fengshenjie' from twisted.internet import reactor from scrapy.utils.project import get_project_settings def run1_single_spider(): '''Running spiders outside projects 只調用spider,不會進入pipeline''' from scrapy.crawler import CrawlerProcess from scrapy_test1.spiders import myspider1 process = CrawlerProcess({ 'USER_AGENT': 'Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 5.1)' }) process.crawl(myspider1) process.start() # the script will block here until the crawling is finished def run2_inside_scrapy(): '''會啓用pipeline''' from scrapy.crawler import CrawlerProcess process = CrawlerProcess(get_project_settings()) process.crawl('spidername') # scrapy項目中spider的name值 process.start() def spider_closing(arg): print('spider close') reactor.stop() def run3_crawlerRunner(): '''若是你的應用程序使用了twisted,建議使用crawlerrunner 而不是crawlerprocess Note that you will also have to shutdown the Twisted reactor yourself after the spider is finished. This can be achieved by adding callbacks to the deferred returned by the CrawlerRunner.crawl method. ''' from scrapy.crawler import CrawlerRunner runner = CrawlerRunner(get_project_settings()) # 'spidername' is the name of one of the spiders of the project. d = runner.crawl('spidername') # stop reactor when spider closes # d.addBoth(lambda _: reactor.stop()) d.addBoth(spider_closing) # 等價寫法 reactor.run() # the script will block here until the crawling is finished def run4_multiple_spider(): from scrapy.crawler import CrawlerProcess process = CrawlerProcess() from scrapy_test1.spiders import myspider1, myspider2 for s in [myspider1, myspider2]: process.crawl(s) process.start() def run5_multiplespider(): '''using CrawlerRunner''' from twisted.internet import reactor from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging configure_logging() runner = CrawlerRunner() from scrapy_test1.spiders import myspider1, myspider2 for s in [myspider1, myspider2]: runner.crawl(s) d = runner.join() d.addBoth(lambda _: reactor.stop()) reactor.run() # the script will block here until all crawling jobs are finished def run6_multiplespider(): '''經過連接(chaining) deferred來線性運行spider''' from twisted.internet import reactor, defer from scrapy.crawler import CrawlerRunner from scrapy.utils.log import configure_logging configure_logging() runner = CrawlerRunner() @defer.inlineCallbacks def crawl(): from scrapy_test1.spiders import myspider1, myspider2 for s in [myspider1, myspider2]: yield runner.crawl(s) reactor.stop() crawl() reactor.run() # the script will block here until the last crawl call is finished if __name__=='__main__': # run4_multiple_spider() # run5_multiplespider() run6_multiplespider()