Python.Scrapy.11-scrapy-source-code-analysis-part-1

Scrapy 源代碼分析系列-1 spider, spidermanager, crawler, cmdline, command

分析的源代碼版本是0.24.6, url: https://github.com/DiamondStudio/scrapy/blob/0.24.6html

如github 中Scrapy 源碼樹所示,包含的子包有:python

commands, contracts, contrib, contrib_exp, core, http, selector, settings, templates, tests, utils, xlibreact

包含的模塊有:git

_monkeypatches.py, cmdline.py, conf.py, conftest.py, crawler.py, dupefilter.py, exceptions.py, github

extension.py, interface.py, item.py, link.py, linkextractor.py, log.py, logformatter.py, mail.py, web

middleware.py, project.py, resolver.py, responsetypes.py, shell.py, signalmanager.py, signals.py,shell

spider.py, spidermanager.py, squeue.py, stats.py, statscol.py, telnet.py, webservice.py app

先從重要的模塊進行分析。框架

0. scrapy依賴的第三方庫或者框架

twistedscrapy

1. 模塊: spider, spidermanager, crawler, cmdline, command

1.1 spider.py spidermanager.py crawler.py

spider.py定義了spider的基類: BaseSpider. 每一個spider實例只能有一個crawler屬性。那麼crawler具有哪些功能呢?

crawler.py定義了類Crawler,CrawlerProcess。

類Crawler依賴: SignalManager, ExtensionManager, ExecutionEngine,  以及設置項STATS_CLASS、SPIDER_MANAGER_CLASS

、LOG_FORMATTER

類CrawlerProcess: 順序地在一個進程中運行多個Crawler。依賴: twisted.internet.reactor、twisted.internet.defer。

啓動爬行(Crawlering)。該類在1.2中cmdline.py會涉及。

 

spidermanager.py定義類SpiderManager, 類SpiderManager用來建立和管理全部website-specific的spider。

 1 class SpiderManager(object):
 2 
 3     implements(ISpiderManager)
 4 
 5     def __init__(self, spider_modules):
 6         self.spider_modules = spider_modules
 7         self._spiders = {}
 8         for name in self.spider_modules:  9             for module in walk_modules(name): 10                 self._load_spiders(module)
11 
12     def _load_spiders(self, module):
13         for spcls in iter_spider_classes(module):
14             self._spiders[spcls.name] = spcls

 

 

 

1.2 cmdline.py command.py

cmdline.py定義了公有函數: execute(argv=None, settings=None)。

函數execute是工具scrapy的入口方法(entry method),以下所示:

 1 XiaoKL$ cat `which scrapy`
 2 #!/usr/bin/python
 3 
 4 # -*- coding: utf-8 -*-
 5 import re
 6 import sys
 7 
 8 from scrapy.cmdline import execute
 9 
10 if __name__ == '__main__':
11     sys.argv[0] = re.sub(r'(-script\.pyw|\.exe)?$', '', sys.argv[0])
12     sys.exit(execute())

 

因此能夠根據這個點爲切入點進行scrapy源碼的分析。下面是execute()函數:

 1 def execute(argv=None, settings=None):
 2     if argv is None:
 3         argv = sys.argv
 4 
 5     # --- backwards compatibility for scrapy.conf.settings singleton ---
 6     if settings is None and 'scrapy.conf' in sys.modules:
 7         from scrapy import conf
 8         if hasattr(conf, 'settings'):
 9             settings = conf.settings
10     # ------------------------------------------------------------------
11 
12     if settings is None:
13         settings = get_project_settings()
14     check_deprecated_settings(settings)
15 
16     # --- backwards compatibility for scrapy.conf.settings singleton ---
17     import warnings
18     from scrapy.exceptions import ScrapyDeprecationWarning
19     with warnings.catch_warnings():
20         warnings.simplefilter("ignore", ScrapyDeprecationWarning)
21         from scrapy import conf
22         conf.settings = settings
23     # ------------------------------------------------------------------
24 
25     inproject = inside_project()
26     cmds = _get_commands_dict(settings, inproject)
27     cmdname = _pop_command_name(argv)
28     parser = optparse.OptionParser(formatter=optparse.TitledHelpFormatter(), \
29         conflict_handler='resolve')
30     if not cmdname:
31         _print_commands(settings, inproject)
32         sys.exit(0)
33     elif cmdname not in cmds:
34         _print_unknown_command(settings, cmdname, inproject)
35         sys.exit(2)
36 
37     cmd = cmds[cmdname]
38     parser.usage = "scrapy %s %s" % (cmdname, cmd.syntax())
39     parser.description = cmd.long_desc()
40     settings.setdict(cmd.default_settings, priority='command')
41     cmd.settings = settings
42     cmd.add_options(parser)
43     opts, args = parser.parse_args(args=argv[1:])
44     _run_print_help(parser, cmd.process_options, args, opts)
45 
46     cmd.crawler_process = CrawlerProcess(settings) 47  _run_print_help(parser, _run_command, cmd, args, opts) 48     sys.exit(cmd.exitcode)

 

execute()函數主要作: 對命令行進行解析並對scrapy命令模塊進行加載;解析命令行參數;獲取設置信息;建立CrawlerProcess對象。

CrawlerProcess對象、設置信息、命令行參數都賦值給ScrapyCommand(或其子類)的對象。

天然咱們須要來查看定義類ScrapyCommand的模塊: command.py。

ScrapyCommand的子類在子包scrapy.commands中進行定義。

 

_run_print_help() 函數最終調用cmd.run(),來執行該命令。以下:

1 def _run_print_help(parser, func, *a, **kw):
2     try:
3         func(*a, **kw)
4     except UsageError as e:
5         if str(e):
6             parser.error(str(e))
7         if e.print_help:
8             parser.print_help()
9         sys.exit(2)

 

func是參數_run_command,該函數的實現主要就是調用cmd.run()方法:

1 def _run_command(cmd, args, opts):
2     if opts.profile or opts.lsprof:
3         _run_command_profiled(cmd, args, opts)
4     else:
5         cmd.run(args, opts)

 

咱們在進行設計時能夠參考這個cmdline/commands無關的設計。

 

command.py: 定義類ScrapyCommand,該類做爲Scrapy Commands的基類。來簡單看一下類ScrapyCommand提供的接口/方法:

 1 class ScrapyCommand(object):
 2 
 3     requires_project = False
 4     crawler_process = None
 5 
 6     # default settings to be used for this command instead of global defaults
 7     default_settings = {}
 8 
 9     exitcode = 0
10 
11     def __init__(self):
12         self.settings = None  # set in scrapy.cmdline
13 
14     def set_crawler(self, crawler):
15         assert not hasattr(self, '_crawler'), "crawler already set"
16         self._crawler = crawler
17 
18     @property
19     def crawler(self):
20         warnings.warn("Command's default `crawler` is deprecated and will be removed. "
21             "Use `create_crawler` method to instatiate crawlers.",
22             ScrapyDeprecationWarning)
23 
24         if not hasattr(self, '_crawler'):
25             crawler = self.crawler_process.create_crawler()
26 
27             old_start = crawler.start
28             self.crawler_process.started = False
29 
30             def wrapped_start():
31                 if self.crawler_process.started:
32                     old_start()
33                 else:
34                     self.crawler_process.started = True
35                     self.crawler_process.start()
36 
37             crawler.start = wrapped_start
38 
39             self.set_crawler(crawler)
40 
41         return self._crawler
42 
43     def syntax(self):
44 
45     def short_desc(self):
46 
47     def long_desc(self):
48 
49     def help(self):
50 
51     def add_options(self, parser):
52 
53     def process_options(self, args, opts):
54     
55     def run(self, args, opts):

 

類ScrapyCommand的類屬性: 

requires_project: 是否須要在Scrapy project中運行
crawler_process:CrawlerProcess對象。在cmdline.py的execute()函數中進行設置。

類ScrapyCommand的方法,重點關注:

def crawler(self): 延遲建立Crawler對象。
def run(self, args, opts): 須要子類進行覆蓋實現。
 

那麼咱們來具體看一個ScrapyCommand的子類(參考 Python.Scrapy.14-scrapy-source-code-analysis-part-4)。 

 

To Be Continued

接下來分析模塊: signals.py signalmanager.py project.py conf.py Python.Scrapy.12-scrapy-source-code-analysis-part-2

相關文章
相關標籤/搜索
本站公眾號
   歡迎關注本站公眾號,獲取更多信息