官方文檔: http://doc.scrapy.org/en/latest/
github例子: https://github.com/search?utf8=%E2%9C%93&q=scrapy
(一)基本的 -- scrapy.spider.Spidershell
dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/" 2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot) 2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django 2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0} 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware 2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines: 2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 2014-08-21 04:09:11+0800 [default] INFO: Spider opened 2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None) [s] Available Scrapy objects: [s] crawler <scrapy.crawler.Crawler object at 0xa483cec> [s] item {} [s] request <GET http://www.baidu.com/> [s] response <200 http://www.baidu.com/> [s] settings <scrapy.settings.Settings object at 0xa0de78c> [s] spider <Spider 'default' at 0xa78086c> [s] Useful shortcuts: [s] shelp() Shell help (print this help) [s] fetch(req_or_url) Fetch request (or URL) and update local objects [s] view(response) View response in a browser >>> # response.body 返回的全部內容 # response.xpath('//ul/li') 能夠測試全部的xpath內容
More important, if you type response.selector you will access a selector object you can use to
query the response, and convenient shortcuts like response.xpath() and response.css() mapping to
response.selector.xpath() and response.selector.css()django
scrapy shell ’http://scrapy.org’ --nolog # 參數 --nolog 沒有日誌
from scrapy import Spider from scrapy_test.items import DmozItem class DmozSpider(Spider): name = 'dmoz' allowed_domains = ['dmoz.org'] start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/', 'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,' ''] def parse(self, response): for sel in response.xpath('//ul/li'): item = DmozItem() item['title'] = sel.xpath('a/text()').extract() item['link'] = sel.xpath('a/@href').extract() item['desc'] = sel.xpath('text()').extract() yield item
可使用,保存文件。格式能夠 json,xml,csv
scrapy crawl -o 'a.json' -t 'json'
scrapy genspider baidu baidu.com # -*- coding: utf-8 -*- import scrapy class BaiduSpider(scrapy.Spider): name = "baidu" allowed_domains = ["baidu.com"] start_urls = ( 'http://www.baidu.com/', ) def parse(self, response): pass
(二)高級 -- scrapy.contrib.spiders.CrawlSpider
class scrapy.contrib.spiders.CrawlSpider This is the most commonly used spider for crawling regular websites, as it provides a convenient mechanism for following links by defining a set of rules. It may not be the best suited for your particular web sites or project, but it’s generic enough for several cases, so you can start from it and override it as needed for more custom functionality, or just implement your own spider. Apart from the attributes inherited from Spider (that you must specify), this class supports a new attribute: rules Which is a list of one (or more) Rule objects. Each Rule defines a certain behaviour for crawling the site. Rules objects are described below. If multiple rules match the same link, the first one will be used, according to the order they’re defined in this attribute. This spider also exposes an overrideable method: parse_start_url(response) This method is called for the start_urls responses. It allows to parse the initial responses and must return either a Item object, a Request object, or an iterable containing any of them.
#coding=utf-8 from scrapy.contrib.spiders import CrawlSpider, Rule from scrapy.contrib.linkextractors import LinkExtractor import scrapy class TestSpider(CrawlSpider): name = 'test' allowed_domains = ['example.com'] start_urls = ['http://www.example.com/'] rules = ( # 元組 Rule(LinkExtractor(allow=('category\.php', ), deny=('subsection\.php', ))), Rule(LinkExtractor(allow=('item\.php', )), callback='pars_item'), ) def parse_item(self, response): self.log('item page : %s' % response.url) item = scrapy.Item() item['id'] = response.xpath('//td[@id="item_id"]/text()').re('ID:(\d+)') item['name'] = response.xpath('//td[@id="item_name"]/text()').extract() item['description'] = response.xpath('//td[@id="item_description"]/text()').extract() return item
其餘的還有 XMLFeedSpider,這個有空再研究吧。
class scrapy.contrib.spiders.XMLFeedSpider class scrapy.contrib.spiders.CSVFeedSpider class scrapy.contrib.spiders.SitemapSpider
>>> from scrapy.selector import Selector
>>> from scrapy.http import HtmlResponse
能夠靈活的使用 .css() 和 .xpath() 來快速的選取目標數據
!!!關於選擇器,須要好好研究一下。xpath() 和 css() ,還要繼續熟悉 正則.
當經過class來進行選擇的時候,儘可能使用 css() 來選擇,而後再用 xpath() 來選擇元素的熟悉
(四)Item Pipeline
After an item has been scraped by a spider, it is sent to the Item Pipeline which process it through several components that are executed sequentially.
Typical use for item pipelines are: • cleansing HTML data # 清除HTML數據 • validating scraped data (checking that the items contain certain fields) # 驗證數據 • checking for duplicates (and dropping them) # 檢查重複 • storing the scraped item in a database # 存入數據庫
from scrapy.exceptions import DropItem class PricePipeline(object): vat_factor = 1.5 def process_item(self, item, spider): if item['price']: if item['price_excludes_vat']: item['price'] *= self.vat_factor else: raise DropItem('Missing price in %s' % item)
import json class JsonWriterPipeline(object): def __init__(self): self.file = open('json.jl', 'wb') def process_item(self, item, spider): line = json.dumps(dict(item)) + '\n' self.file.write(line) return item
from scrapy.exceptions import DropItem class Duplicates(object): def __init__(self): self.ids_seen = set() def process_item(self, item, spider): if item['id'] in self.ids_seen: raise DropItem('Duplicate item found : %s' % item) else: self.ids_seen.add(item['id']) return item
至於將數據寫入數據庫,應該也很簡單。在 process_item 函數中,將 item 存入進去便可了。
