首先新建一個Scrapy項目,若是不知道項目怎麼建的,請看前面爬取豆瓣TOP電影那篇文章。python
目錄結構以下:git
由於我只是爬取問題,因此item裏面只有一個title項,直接上zhihu_spider.py代碼:github
1 # -*- coding: utf-8 -*- 2 from scrapy.spiders import BaseSpider 3 from scrapy.selector import HtmlXPathSelector 4 from zhihu.items import ZhihuItem 5 import scrapy 6 import sys 7 reload(sys) 8 sys.setdefaultencoding("utf-8") # 保證編碼爲utf-8 9 10 class ZhihuSpider(BaseSpider): 11 """docstring for ZhihuSpider""" 12 name = "zhihu" 13 allowed_domains = ["zhihu.com"] 14 start_urls = ["http://www.zhihu.com/topic/19550517/questions?page=" +str(page) for page in range(1,21500)] 15 16 def parse(self, response): 17 hax = HtmlXPathSelector(response) 18 for seq in hax.xpath('//div/h2'): 19 item = ZhihuItem() 20 item['title'] = seq.xpath('a/text()').extract() 21 22 yield item
而後從apart.txt文件裏面取出這些問題,將這些問題分詞,對分出的詞計數。redis
此處用到兩個庫,一個是redis,一個是jieba分詞。其中redis是須要在本地或者服務器上安裝redis數據庫的,至於redis怎麼用,仍是去官方文檔上查吧。數據庫
redis官網http://redis.io/服務器
python中redis-py庫:https://github.com/andymccurdy/redis-pydom
這兩個庫都不是python自帶的,須要自行安裝。scrapy
1 # -*- coding: utf-8 -*- 2 3 import re 4 import math 5 import jieba 6 import redis 7 8 def getwords(doc): 9 f = open(doc,'r') 10 for line in f.readlines(): 11 # print line 12 words = jieba.cut(line) 13 for word in words: 14 if(r.exists(word)): 15 r.incr(word) 16 else: 17 r.set(word,1) 18 19 r = redis.Redis(host='127.0.0.1',port=6379,db=3) 20 getwords('apart.txt') 21 最後從redis數據庫裏面取出分詞,排序: 22 23 # -*- coding: utf-8 -*- 24 25 import redis 26 27 r = redis.Redis('127.0.0.1', 6379, db=3) 28 print "size:",r.dbsize() 29 30 keys = r.keys() 31 fc={} 32 for key in keys: 33 value = int(r.get(key)) 34 if value <= 3000 and value>=100: 35 fc[key]=value 36 37 dic = sorted(fc.iteritems(),key = lambda fc:fc[1],reverse=True) 38 for item in dic: 39 print item[0],item[1]
來看看最終結果(部分):ide