爬蟲最基本的部分是要將網頁下載,而最重要的部分是過濾 -- 獲取咱們須要的信息。java
而scrapy正好提供了這個功能:python
首先咱們要定義items:數據結構
Itemsare containers that will be loaded with the scraped data; they work like simple python dicts but provide additional protection against populating undeclared fields, to prevent typos.dom
摘自官網,大意就是說iteams是用來存儲抓取的數據結構,提供相對與python字典類型額外的類型保護。(這個具體的保護方式待研究)scrapy
示例以下:ide
project/items.py
函數
import scrapyclass DmozItem(scrapy.Item): title = scrapy.Field() link = scrapy.Field() desc = scrapy.Field()
而後咱們須要編寫Spider,抓取網頁並選擇信息,將其放入items之中。網站
示例以下:url
import scrapyclass DmozSpider(scrapy.Spider): name = "dmoz" allowed_domains = ["dmoz.org"] start_urls = [ "http://www.dmoz.org/Computers/Programming/Languages/Python/Books/", "http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/" ] def parse(self, response): filename = response.url.split("/")[-2] with open(filename, 'wb') as f: f.write(response.body)
說明:spa
name Spider的名字,這個名字要在這個項目之中惟一,緣由稍後就知道了。
allowed_domains,這個是域名設置,便是否抓取其餘域名,通常攝製成要start_urls中地址的域名便可
開始啓動蜘蛛:
scrapy crawl dmoz
正常的化輸出以下:
2014-01-23 18:13:07-0400 [scrapy] INFO: Scrapy started (bot: tutorial) 2014-01-23 18:13:07-0400 [scrapy] INFO: Optional features available: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Overridden settings: {} 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled extensions: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled downloader middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled spider middlewares: ... 2014-01-23 18:13:07-0400 [scrapy] INFO: Enabled item pipelines: ... 2014-01-23 18:13:07-0400 [dmoz] INFO: Spider opened 2014-01-23 18:13:08-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/> (referer: None) 2014-01-23 18:13:09-0400 [dmoz] DEBUG: Crawled (200) <GET http://www.dmoz.org/Computers/Programming/Languages/Python/Books/> (referer: None) 2014-01-23 18:13:09-0400 [dmoz] INFO: Closing spider (finished)
待續