以前使用scrapy實現了一個超級簡單的爬蟲工具,用於抓取豆瓣上面全部的編程書籍信息(因爲不須要爬取整個頁面的因此連接,因此不須要用到BFS or DFS,只實現順序抓取下一頁)python
此次使用的是python自帶的urllib 和urllib2等python模塊實現,一樣以豆瓣上面的愛情電影信息做爲爬取對象,方法與過程其實一模一樣,一樣是對每個頁面發出請求後獲取響應獲得的網頁源碼,再使用正則表達式去匹配得到所需信息,而後獲取下一頁的連接繼續爬取。正則表達式
爬取頁面:編程
網頁源碼:json
title and link: 多線程
desc: app
肯定要要抓取的頁面和信息後,就能夠開始寫代碼啦~用一個類來執行工做,最後將爬取信息存放到movie.json 文件中,代碼以下:框架
# -*- coding:utf-8 -*- import urllib import urllib2 import re import json class FilmSpider: def __init__(self, init_url_): self.start_url = init_url_ self.contain = [] self.curPage = 1 def Parse(self): nexturl = self.start_url while nexturl != None: nexturl = self.parsePage(nexturl) self.curPage += 1 f = open('movie2.json', 'a') # JsonLinesItemExporter """ for each_item in self.contain: tar = json.dumps(each_item, ensure_ascii=False) f.write(tar+'\n') """ # JsonItemExporter first_item = True f.write('[') for each_item in self.contain: if first_item: first_item = False else: f.write(',\n') tar = json.dumps(each_item, ensure_ascii=False) f.write(tar) f.write(']') f.close() def parsePage(self, cururl): print('sola is scrawling the %d page: '%self.curPage + cururl) response = urllib2.urlopen(cururl) body = response.read() #返回全部在<tr class="item"></tr>中的字符串的list myItems = re.findall('<tr class="item">(.*?)</tr>', body, re.S) for item in myItems: # 返回全部title 和link組成的tuple所組成的list, 其實list中只有一個tuple info = re.findall('<a.*?class="nbg".*?href="(.*?)".*?title="(.*?)">', item, re.S) # 返回<p>內的描述所構成的list, 同理只有一個元素 desc = re.findall('<p.*?class="pl">(.*?)</p>', item, re.S) #newItem = Item(info[0][0], info[0][1], desc[0]) newItem = {} newItem['link'] = info[0][0] newItem['title'] = info[0][1] newItem['desc'] = desc[0] self.contain.append(newItem) print newItem['title'] print newItem['link'] print newItem['desc'] print ('\n\n------------------------------------------------------------------------------------------------\n\n') Next = re.findall('<span.*?class="next">.*?<link.*?rel="next".*?href="(.*?)"/>', body, re.S) if Next: return Next[0] return None #-------------------------Main-------------------- #program: spider for Love in Douban #author : Patrick #------------------------------------------------- print('-----------------Start crawl----------------------') initurl = 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=0&type=T' solaSpider = FilmSpider(initurl) solaSpider.Parse() print('-----------------End crawl------------------------')
代碼中關於json文件的存儲由兩種方式,對應於scrapy框架中的兩種存儲方式(JsonLinesItemExporter 和 JsonItemExporter 都繼承於BaseItemExporter):scrapy
1. JsonLinesItemExporter: 也就是將每個item轉換成json格式後寫入,對應與Scrapy中的代碼以下:ide
2. JsonItemExporter: 將全部轉換成json格式的item再統一放進一個list裏面,對應scrapy代碼以下:工具
好啦,獲得的movie.json 就是醬:
下面是用多線程版本進行爬取,使用了5個線程,速度比上面單線程快了不少
# -*- coding:utf-8 -*- import urllib import urllib2 import re import json import thread import threading class FilmSpider: def __init__(self, init_url_): self.start_url = init_url_ self.contain = [] self.curPage = 0 def Parse(self): i = 0 th = [] while i < 5: pid = threading.Thread(target=self.parsePage, args=(self.start_url[i],i)) pid.start() th.append(pid) i = i+1 i = 0 while i < 5: th[i].join() i = i+1 f = open('movie2.json', 'a') # JsonLinesItemExporter """ for each_item in self.contain: tar = json.dumps(each_item, ensure_ascii=False) f.write(tar+'\n') """ # JsonItemExporter first_item = True f.write('[') for each_item in self.contain: if first_item: first_item = False else: f.write(',\n') tar = json.dumps(each_item, ensure_ascii=False) f.write(tar) f.write(']') f.close() def parsePage(self, cururl, ID): print('sola is scrawling the %d page: '%ID + cururl) response = urllib2.urlopen(cururl) body = response.read() #返回全部在<tr class="item"></tr>中的字符串的list myItems = re.findall('<tr class="item">(.*?)</tr>', body, re.S) for item in myItems: # 返回全部title 和link組成的tuple所組成的list, 其實list中只有一個tuple info = re.findall('<a.*?class="nbg".*?href="(.*?)".*?title="(.*?)">', item, re.S) # 返回<p>內的描述所構成的list, 同理只有一個元素 desc = re.findall('<p.*?class="pl">(.*?)</p>', item, re.S) #newItem = Item(info[0][0], info[0][1], desc[0]) newItem = {} newItem['link'] = info[0][0] newItem['title'] = info[0][1] newItem['desc'] = desc[0] self.contain.append(newItem) """ print (newItem['title']) print (newItem['link']) print (newItem['desc']) print ('\n\n------------------------------------------------------------------------------------------------\n\n') """ Next = 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=' Next += str((ID+5)*20) + '&type=T' Next = Next.strip() if ID+5 > 274: thread.exit() self.parsePage(Next, ID+5) #-------------------------Main-------------------- #program: spider for Love in Douban #author : Patrick #------------------------------------------------- print('-----------------Start crawl----------------------') initurl = ['http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=0&type=T', 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=20&type=T', 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=40&type=T', 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=60&type=T', 'http://movie.douban.com/tag/%E7%88%B1%E6%83%85?start=80&type=T' ] solaSpider = FilmSpider(initurl) solaSpider.Parse() print('-----------------End crawl------------------------')