(轉)Scrapy 深刻一點點

Scrapy 深刻一點點

愈來愈感受到scrapy的便利,下邊繼續記錄Scrapyphp

  1. scrapy是基於twisted框架http://twistedmatrix.com/trac/編寫的,搞定PyBrain有機會就繼續深刻一下Twisted框架。node

    Twisted is an event-driven networking engine written in Python and licensed under the open source

1. 上一篇中缺乏了不少記述,如今補充上web

* `scrapy startproject xxx` 新建一個xxx的project
* `scrapy crawl xxx` 開始爬取,必須在project中
* `scrapy shell url` 在scrapy的shell中打開url,很是實用
* `scrapy runspider <spider_file.py>` 能夠在沒有project的狀況下運行爬蟲
  1. scrapy crawl xxx -a category=xxx 向spider傳遞參數(早知道這個,京東爬蟲就不會寫的那麼亂套了,哎)def __init__(self, category=None): 在spider的init函數獲取參數。shell

  2. 第一個Request對象是由make_requests_from_url函數生成的,callback=self.parse。服務器

  3. 除了BaseSpider之外,還有不少能夠直接繼承來用的Spider,好比class scrapy.contrib.spiders.CrawlSpider框架

    This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.

    這個比BaseSpider多了一個rules對象,經過這個Rules咱們能夠選擇爬取哪些結構的URL。示例代碼:dom

    from scrapy.contrib.spiders import CrawlSpider, Rule
    from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor
    from scrapy.selector import HtmlXPathSelector
    from scrapy.item import Item
    
    class MySpider(CrawlSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com']
    
        rules = (
            # Extract links matching 'category.php' (but not matching 'subsection.php')
            # and follow links from them (since no callback means follow=True by default).
            Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))),
    
            # Extract links matching 'item.php' and parse them with the spider's method parse_item
            Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'),
        )
    
        def parse_item(self, response):
            self.log('Hi, this is an item page! %s' % response.url)
    
            hxs = HtmlXPathSelector(response)
            item = Item()
            item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)')
            item['name'] = hxs.select('//td[@id="item_name"]/text()').extract()
            item['description'] = hxs.select('//td[@id="item_description"]/text()').extract()
            return item

    XMLFeedSpider: from scrapy import log from scrapy.contrib.spiders import XMLFeedSpider from myproject.items import TestItemscrapy

    class MySpider(XMLFeedSpider):
        name = 'example.com'
        allowed_domains = ['example.com']
        start_urls = ['http://www.example.com/feed.xml']
        iterator = 'iternodes' # This is actually unnecessary, since it's the default value
        itertag = 'item'
    
        def parse_node(self, response, node):
            log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract())))
    
            item = Item()
            item['id'] = node.select('@id').extract()
            item['name'] = node.select('name').extract()
            item['description'] = node.select('description').extract()
            return item

    還有CSVFeedSpider SitemapSpider 等等各類針對不一樣需求的Spider,scrapy.contrib.spiderside

  4. Scrapy 還提供了一個服務器版scrapyd。能夠方便的上傳管理爬蟲任務。函數

相關文章
相關標籤/搜索