本文爲做者原創轉載請註明出處(silvasong:http://my.oschina.net/sojie/admin/edit-blog?blog=653199)html
前面的文章對scrapy的源碼進行簡單的分析,這裏我將經過一個簡單的例子介紹怎樣使用scrapy。
python
肯定須要爬取一個網站以後,最早須要作的工做就是分析網站層次結構,選擇入口URL.通常狀況下咱們都是選擇網站的首頁做爲起始連接.linux
分析一號店的過程當中,我發現一號店提供了一個商品分類頁面(http://www.yhd.com/marketing/allproduct.html)從這個頁面中就能夠獲取到全部商品的分類.而後咱們經過每一個分類的連接又可以獲得每一個分類下的商品.git
開發環境:github
ubuntu、python 2.七、scrapyweb
scrapy能夠運行在window、mac、linux上面,爲了開發方便這裏我選擇的ubuntu,另外scrapy是基於python開發的因此安裝python也是必須的.最後就是安裝scrapy。正則表達式
完成環境的搭建之後接下將一步步介紹具體的實現:ubuntu
1、第一步先經過scrapy startproject yhd 建立一個爬蟲工程.restful
運行上面的命令後能夠生成相似下面的文件結構. tutorial被替換成yhd。cookie
scrapy.cfg scrapy配置文件能夠保持默認不修改.
items.py 用來定義存儲的數據結構。
pipelines.py scrapy管道用來持久化數據
spiders/ spiders文件夾是你本身編寫的spider
settings.py 配置文件
2、編寫item.py,這裏我定義了繼承scrapy.Item的YhdItem,YhdItem中定義了須要爬取的字段.
import scrapy class YhdItem(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title=scrapy.Field() #商品名稱 price=scrapy.Field() #商品價格 link=scrapy.Field() #商品連接 category=scrapy.Field() #商品分類 product_id=scrapy.Field() #產品ID img_link=scrapy.Field() #圖片連接 pass
三,編寫pipelines.py。
使用mongo來持久化數據編寫了一個MongoPipeline。
class MongoPipeline(object): collection_name='product' def __init__(self,mongo_uri,mongo_db): self.mongo_uri=mongo_uri self.mongo_db=mongo_db @classmethod def from_crawler(cls,crawler): return cls(mongo_uri=crawler.settings.get('MONGO_URI'), mongo_db=crawler.settings.get('MONGO_DB') ) def open_spider(self,spider): #經過URL獲取db self.client=pymongo.MongoClient(self.mongo_uri) self.db=self.client[self.mongo_db] def close_spider(self,spider): self.client.close() def process_item(self,item,spider): #經過方法process_item將數據寫入Mongo if isinstance(item,YhdItem): self.db[self.collection_name].insert(dict(item)) else: self.db['product_price'].insert(dict(item)) return item
4、在spiders文件夾下面編寫spider.py。
spider.py中是經過正則表達式匹配須要爬取的URL,經過XPATH從HTML中提取數據.
class YHDSpider(CrawlSpider): name='yhd' allowed_domains=['yhd.com'] start_urls=[ ' #定義種子URL ] rules=[ Rule(le(allow=('http://www.yhd.com/marketing/allproduct.html')),follow=True), Rule(le(allow=('^http://list.yhd.com/c.*//$')),follow=True), Rule(le(allow=('^http://list.yhd.com/c.*/b/a\d*-s1-v4-p\d+-price-d0-f0d-m1-rt0-pid-mid0-k/$')),follow=True), Rule(le(allow=('^http://item.yhd.com/item/\d+$')),callback='parse_product') ] #經過正則表達匹配須要爬取的URL def parse_product(self,response): item=YhdItem() #建立YhdItem對象 #經過xpath解析html item['title']=response.xpath('//h1[@id="productMainName"]/text()').extract() price_str=response.xpath('//a[@class="ico_sina"]/@href').extract()[0] item['price']=price_str item['link']=response.url pmld = response.url.split('/')[-1] price_url='http://gps.yhd.com/restful/detail?mcsite=1&provinceId=12&pmId='+pmld item['category']=response.xpath('//div[@class="crumb clearfix"]/a[contains(@onclick,"detail_BreadcrumbNav_cat")]/text()').extract() item['product_id']=response.xpath('//p[@id="pro_code"]/text()').extract() item['img_link']=response.xpath('//img[@id="J_prodImg"]/@src').extract()[0] request = Request(price_url,callback=self.parse_price) #商品的價格須要異步獲取,經過商品ID獲取價格 request.meta['item']=item yield request def parse_price(self,response): item = response.meta['item'] item['price']=response.body return item def _process_request(self,request): return request
5、編寫配置文件settings.py.
# -*- coding: utf-8 -*- # Scrapy settings for yhd project # # For simplicity, this file contains only settings considered important or # commonly used. You can find more settings consulting the documentation: # # http://doc.scrapy.org/en/latest/topics/settings.html # http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html # http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html BOT_NAME = 'yhd' SPIDER_MODULES = ['yhd.spiders'] #定義spider模塊 NEWSPIDER_MODULE = 'yhd.spiders' # Crawl responsibly by identifying yourself (and your website) on the user-agent #USER_AGENT = 'yhd (+http://www.yourdomain.com)' # Configure maximum concurrent requests performed by Scrapy (default: 16) #CONCURRENT_REQUESTS=32 # Configure a delay for requests for the same website (default: 0) # See http://scrapy.readthedocs.org/en/latest/topics/settings.html#download-delay # See also autothrottle settings and docs #DOWNLOAD_DELAY=3 # The download delay setting will honor only one of: #CONCURRENT_REQUESTS_PER_DOMAIN=16 #CONCURRENT_REQUESTS_PER_IP=16 # Disable cookies (enabled by default) #COOKIES_ENABLED=False # Disable Telnet Console (enabled by default) #TELNETCONSOLE_ENABLED=False # Override the default request headers: #DEFAULT_REQUEST_HEADERS = { # 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8', # 'Accept-Language': 'en', #} # Enable or disable spider middlewares # See http://scrapy.readthedocs.org/en/latest/topics/spider-middleware.html #SPIDER_MIDDLEWARES = { # 'yhd.middlewares.MyCustomSpiderMiddleware': 543, #} # Enable or disable downloader middlewares # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html #DOWNLOADER_MIDDLEWARES = { # 'yhd.middlewares.MyCustomDownloaderMiddleware': 543, #} # Enable or disable extensions # See http://scrapy.readthedocs.org/en/latest/topics/extensions.html #EXTENSIONS = { # 'scrapy.telnet.TelnetConsole': None, #} # Configure item pipelines # See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html ITEM_PIPELINES = { 'yhd.pipelines.MongoPipeline': 300, } #配置pipeline MONGO_URI='127.0.0.1' #mongo配置 MONGO_DB='yhd' # Enable and configure the AutoThrottle extension (disabled by default) # See http://doc.scrapy.org/en/latest/topics/autothrottle.html # NOTE: AutoThrottle will honour the standard settings for concurrency and delay #AUTOTHROTTLE_ENABLED=True # The initial download delay #AUTOTHROTTLE_START_DELAY=5 # The maximum download delay to be set in case of high latencies #AUTOTHROTTLE_MAX_DELAY=60 # Enable showing throttling stats for every response received: #AUTOTHROTTLE_DEBUG=False # Enable and configure HTTP caching (disabled by default) # See http://scrapy.readthedocs.org/en/latest/topics/downloader-middleware.html#httpcache-middleware-settings #HTTPCACHE_ENABLED=True #HTTPCACHE_EXPIRATION_SECS=0 #HTTPCACHE_DIR='httpcache' #HTTPCACHE_IGNORE_HTTP_CODES=[] #HTTPCACHE_STORAGE='scrapy.extensions.httpcache.FilesystemCacheStorage'
完成代碼編寫後能夠經過scrapy crawl yhd 命令啓動爬蟲.
完整源代碼能夠經過個人github下載:https://github.com/silvasong/yhd_scrapy