愈來愈感受到scrapy的便利,下邊繼續記錄Scrapyphp
scrapy是基於twisted框架http://twistedmatrix.com/trac/編寫的,搞定PyBrain有機會就繼續深刻一下Twisted框架。node
Twisted is an event-driven networking engine written in Python and licensed under the open source
1. 上一篇中缺乏了不少記述,如今補充上web
* `scrapy startproject xxx` 新建一個xxx的project * `scrapy crawl xxx` 開始爬取,必須在project中 * `scrapy shell url` 在scrapy的shell中打開url,很是實用 * `scrapy runspider <spider_file.py>` 能夠在沒有project的狀況下運行爬蟲
scrapy crawl xxx -a category=xxx 向spider傳遞參數(早知道這個,京東爬蟲就不會寫的那麼亂套了,哎)def __init__(self, category=None): 在spider的init函數獲取參數。shell
第一個Request對象是由make_requests_from_url函數生成的,callback=self.parse。服務器
除了BaseSpider之外,還有不少能夠直接繼承來用的Spider,好比class scrapy.contrib.spiders.CrawlSpider框架
This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider.
這個比BaseSpider多了一個rules對象,經過這個Rules咱們能夠選擇爬取哪些結構的URL。示例代碼:dom
from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors.sgml import SgmlLinkExtractor from scrapy.selector import HtmlXPathSelector from scrapy.item import Item class MySpider(CrawlSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com'] rules = ( # Extract links matching 'category.php' (but not matching 'subsection.php') # and follow links from them (since no callback means follow=True by default). Rule(SgmlLinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), # Extract links matching 'item.php' and parse them with the spider's method parse_item Rule(SgmlLinkExtractor(allow=('item\.php', )), callback='parse_item'), ) def parse_item(self, response): self.log('Hi, this is an item page! %s' % response.url) hxs = HtmlXPathSelector(response) item = Item() item['id'] = hxs.select('//td[@id="item_id"]/text()').re(r'ID: (\d+)') item['name'] = hxs.select('//td[@id="item_name"]/text()').extract() item['description'] = hxs.select('//td[@id="item_description"]/text()').extract() return item
XMLFeedSpider: from scrapy import log from scrapy.contrib.spiders import XMLFeedSpider from myproject.items import TestItemscrapy
class MySpider(XMLFeedSpider): name = 'example.com' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/feed.xml'] iterator = 'iternodes' # This is actually unnecessary, since it's the default value itertag = 'item' def parse_node(self, response, node): log.msg('Hi, this is a <%s> node!: %s' % (self.itertag, ''.join(node.extract()))) item = Item() item['id'] = node.select('@id').extract() item['name'] = node.select('name').extract() item['description'] = node.select('description').extract() return item
還有CSVFeedSpider SitemapSpider 等等各類針對不一樣需求的Spider,scrapy.contrib.spiderside
Scrapy 還提供了一個服務器版scrapyd。能夠方便的上傳管理爬蟲任務。函數