分析的源代碼版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6html
如github 中Scrapy 源碼樹所示,包含的子包有:python
commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlibreact
包含的模塊有:git
_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py, github
extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py, web
middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,shell
spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py app
先從重要的模塊進行分析。框架
twistedscrapy
spider.py定義了spider的基類: BaseSpider. 每一個spider實例只能有一個crawler屬性。那麼crawler具有哪些功能呢?
crawler.py定義了類Crawler,CrawlerProcess。
類Crawler依賴: SignalManager, ExtensionManager, ExecutionEngine, 以及設置項STATS_CLASS、SPIDER_MANAGER_CLASS
、LOG_FORMATTER
類CrawlerProcess: 順序地在一個進程中運行多個Crawler。依賴: twisted.internet.reactor、twisted.internet.defer。
啓動爬行(Crawlering)。該類在1.2中cmdline.py會涉及。
spidermanager.py定義類SpiderManager, 類SpiderManager用來建立和管理全部website-specific的spider。
1 class SpiderManager(object): 2 3 implements(ISpiderManager) 4 5 def __init__(self, spider_modules): 6 self.spider_modules = spider_modules 7 self._spiders = {} 8 for name in self.spider_modules: 9 for module in walk_modules(name): 10 self._load_spiders(module) 11 12 def _load_spiders(self, module): 13 for spcls in iter_spider_classes(module): 14 self._spiders[spcls.name] = spcls
cmdline.py定義了公有函數: execute(argv=None, settings=None)。
函數execute是工具scrapy的入口方法(entry method),以下所示:
1 XiaoKL$ cat `which scrapy` 2 #!/usr/bin/python 3 4 # -*- coding: utf-8 -*- 5 import re 6 import sys 7 8 from scrapy.cmdline import execute 9 10 if __name__ == '__main__': 11 sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0]) 12 sys.exit(execute())
因此能夠根據這個點爲切入點進行scrapy源碼的分析。下面是execute()函數:
1 def execute(argv=None, settings=None): 2 if argv is None: 3 argv = sys.argv 4 5 # --- backwards compatibility for scrapy.conf.settings singleton --- 6 if settings is None and 'scrapy.conf' in sys.modules: 7 from scrapy import conf 8 if hasattr(conf, 'settings'): 9 settings = conf.settings 10 # ------------------------------------------------------------------ 11 12 if settings is None: 13 settings = get_project_settings() 14 check_deprecated_settings(settings) 15 16 # --- backwards compatibility for scrapy.conf.settings singleton --- 17 import warnings 18 from scrapy.exceptions import ScrapyDeprecationWarning 19 with warnings.catch_warnings(): 20 warnings.simplefilter("ignore", ScrapyDeprecationWarning) 21 from scrapy import conf 22 conf.settings = settings 23 # ------------------------------------------------------------------ 24 25 inproject = inside_project() 26 cmds = _get_commands_dict(settings, inproject) 27 cmdname = _pop_command_name(argv) 28 parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \ 29 conflict_handler='resolve') 30 if not cmdname: 31 _print_commands(settings, inproject) 32 sys.exit(0) 33 elif cmdname not in cmds: 34 _print_unknown_command(settings, cmdname, inproject) 35 sys.exit(2) 36 37 cmd = cmds[cmdname] 38 parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax()) 39 parser.description = cmd.long_desc() 40 settings.setdict(cmd.default_settings, priority='command') 41 cmd.settings = settings 42 cmd.add_options(parser) 43 opts, args = parser.parse_args(args=argv[1:]) 44 _run_print_help(parser, cmd.process_options, args, opts) 45 46 cmd.crawler_process = CrawlerProcess(settings) 47 _run_print_help(parser, _run_command, cmd, args, opts) 48 sys.exit(cmd.exitcode)
execute()函數主要作: 對命令行進行解析並對scrapy命令模塊進行加載;解析命令行參數;獲取設置信息;建立CrawlerProcess對象。
CrawlerProcess對象、設置信息、命令行參數都賦值給ScrapyCommand(或其子類)的對象。
天然咱們須要來查看定義類ScrapyCommand的模塊: command.py。
ScrapyCommand的子類在子包scrapy.commands中進行定義。
_run_print_help() 函數最終調用cmd.run(),來執行該命令。以下:
1 def _run_print_help(parser, func, *a, **kw): 2 try: 3 func(*a, **kw) 4 except UsageError as e: 5 if str(e): 6 parser.error(str(e)) 7 if e.print_help: 8 parser.print_help() 9 sys.exit(2)
func是參數_run_command,該函數的實現主要就是調用cmd.run()方法:
1 def _run_command(cmd, args, opts): 2 if opts.profile or opts.lsprof: 3 _run_command_profiled(cmd, args, opts) 4 else: 5 cmd.run(args, opts)
咱們在進行設計時能夠參考這個cmdline/commands無關的設計。
command.py: 定義類ScrapyCommand,該類做爲Scrapy Commands的基類。來簡單看一下類ScrapyCommand提供的接口/方法:
1 class ScrapyCommand(object): 2 3 requires_project = False 4 crawler_process = None 5 6 # default settings to be used for this command instead of global defaults 7 default_settings = {} 8 9 exitcode = 0 10 11 def __init__(self): 12 self.settings = None # set in scrapy.cmdline 13 14 def set_crawler(self, crawler): 15 assert not hasattr(self, '_crawler'), "crawler already set" 16 self._crawler = crawler 17 18 @property 19 def crawler(self): 20 warnings.warn("Command's default `crawler` is deprecated and will be removed. " 21 "Use `create_crawler` method to instatiate crawlers.", 22 ScrapyDeprecationWarning) 23 24 if not hasattr(self, '_crawler'): 25 crawler = self.crawler_process.create_crawler() 26 27 old_start = crawler.start 28 self.crawler_process.started = False 29 30 def wrapped_start(): 31 if self.crawler_process.started: 32 old_start() 33 else: 34 self.crawler_process.started = True 35 self.crawler_process.start() 36 37 crawler.start = wrapped_start 38 39 self.set_crawler(crawler) 40 41 return self._crawler 42 43 def syntax(self): 44 45 def short_desc(self): 46 47 def long_desc(self): 48 49 def help(self): 50 51 def add_options(self, parser): 52 53 def process_options(self, args, opts): 54 55 def run(self, args, opts):
類ScrapyCommand的類屬性:
requires_project: 是否須要在Scrapy project中運行
crawler_process:CrawlerProcess對象。在cmdline.py的execute()函數中進行設置。
類ScrapyCommand的方法,重點關注:
def crawler(self): 延遲建立Crawler對象。
def run(self, args, opts): 須要子類進行覆蓋實現。
那麼咱們來具體看一個ScrapyCommand的子類(參考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。
To Be Continued:
接下來分析模塊: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2