每日一練,每日一博。web
Scrapy,Python開發的一個快速,高層次的屏幕抓取和web抓取框架,用於抓取web站點並從頁面中提取結構化的數據。Scrapy用途普遍,能夠用於數據挖掘、監測和自動化測試。redis
1.肯定目標網站:豆瓣電影 http://movie.douban.com/top250框架
2.建立Scrapy項目: scrapy startproject doubanmoviescrapy
3.配置settings.py文件ide
BOT_NAME = 'doubanmovie' SPIDER_MODULES = ['doubanmovie.spiders'] NEWSPIDER_MODULE = 'doubanmovie.spiders' USER_AGENT = 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_8_3) AppleWebKit/536.5 (KHTML, like Gecko) Chrome/19.0.1084.54 Safari/536.5' FEED_URI = u'file:///G:/program/doubanmovie/douban.csv' #將抓取的數據存放到douban.csv文件中 FEED_FORMAT = 'CSV'
3.定義數據items.py:測試
from scrapy import Item,Field class DoubanmovieItem(Item): # define the fields for your item here like: # name = scrapy.Field() title = Field() #標題--電影名 movieInfo = Field() #電影信息 star = Field() #電影評分 quote = Field() #名句
4.建立爬蟲doubanspider.py:網站
import scrapy from scrapy.spiders import CrawlSpider from scrapy.http import Request from scrapy.selector import Selector from doubanmovie.items import DoubanmovieItem class Douban(CrawlSpider): name = "douban" redis_key = 'douban:start_urls' start_urls = ['http://movie.douban.com/top250'] url = 'http://movie.douban.com/top250' def parse(self,response): # print response.body item = DoubanmovieItem() selector = Selector(response) Movies = selector.xpath('//div[@class="info"]') for eachMoive in Movies: title = eachMoive.xpath('div[@class="hd"]/a/span/text()').extract() fullTitle = '' for each in title: fullTitle += each movieInfo = eachMoive.xpath('div[@class="bd"]/p/text()').extract() star = eachMoive.xpath('div[@class="bd"]/div[@class="star"]/span/em/text()').extract()[0] quote = eachMoive.xpath('div[@class="bd"]/p[@class="quote"]/span/text()').extract() #quote可能爲空,所以須要先進行判斷 if quote: quote = quote[0] else: quote = '' item['title'] = fullTitle item['movieInfo'] = ';'.join(movieInfo) item['star'] = star item['quote'] = quote yield item nextLink = selector.xpath('//span[@class="next"]/link/@href').extract() #第10頁是最後一頁,沒有下一頁的連接 if nextLink: nextLink = nextLink[0] print nextLink yield Request(self.url + nextLink,callback=self.parse)
5.爬取結果:若是出現編碼問題,在excel文件中選擇「utf-8」的編碼保存文件便可編碼