今天使用python爬蟲實現了自動抓取糗事百科的段子,由於糗事百科不須要登陸,抓取比較簡單。程序每按一次回車輸出一條段子,代碼參考了 http://cuiqingcai.com/990.html 但該博主的代碼彷佛有些問題,我本身作了修改,運行成功,下面是代碼內容:html
# -*- coding:utf-8 -*- __author__ = 'Jz' import urllib2 import re #糗事百科爬蟲類 class QSBK: #初始化 def __init__(self): self.pageIndex = 1 self.user_agent = 'Mozilla/5.0 (Windows NT 6.1; WOW64)' self.headers = {'User-Agent': self.user_agent} #joke的每個元素是每一頁的段子 self.joke = [] #判斷是否繼續運行 self.enable = False def getPage(self, pageIndex): try: URL = 'http://www.qiushibaike.com/hot/page/' + str(pageIndex) request = urllib2.Request(url = URL, headers = self.headers) response = urllib2.urlopen(request) pageContent = response.read().decode('utf-8') return pageContent except urllib2.URLError, e: if hasattr(e, 'reason'): print '段子抓取失敗,失敗緣由:', e.reason return None def getJokeList(self, pageIndex): pageContent = self.getPage(pageIndex) if not pageContent: print '段子獲取失敗...' return None #第三個組中的內容用於判斷段子是否附帶圖片 pattern = re.compile(r'<div.*?class="author">.*?<a.*?>.*?<img.*?/>\n(.*?)\n</a>.*?</div>.*?<div class="content">\n\n(.*?)\n<!--.*?-->.*?</div>' + r'(.*?)class="stats">.*?<span.*?class="stats-vote"><i.*?class="number">(.*?)</i>' , re.S) jokes = re.findall(pattern, pageContent) pageJokes = [] #過濾帶有圖片的段子 for joke in jokes: hasImg = re.search('img', joke[2]) #joke[0]爲發佈人,joke[1]爲段子內容,joke[3]爲點贊數量 if not hasImg: pageJokes.append([joke[0].strip(), joke[1].strip(), joke[3].strip()]) return pageJokes def loadPage(self): if self.enable == True: #若當前已看的頁數少於兩頁,則加載新的一頁 if len(self.joke) < 2: pageJokes = self.getJokeList(self.pageIndex) if pageJokes: self.joke.append(pageJokes) self.pageIndex += 1 #每輸入一次回車,打印一條段子 def getOneJoke(self, pageJokes, page): jokes = pageJokes for joke in jokes: userInput = raw_input('請輸入回車鍵或Q/q: ') self.loadPage() if userInput == 'Q' or userInput == 'q': self.enable = False print '退出爬蟲...' return print u'段子內容:%s\n第%d頁\t發佈人:%s\t贊:%s' % (joke[1], page, joke[0], joke[2]) def start(self): print '正在從糗事百科抓取段子,按回車鍵查看新段子,按Q/q退出...' self.enable = True self.loadPage() page = 0 while self.enable: if len(self.joke) > 0: pageJokes = self.joke[0] page += 1 #刪除已經讀取過的段子頁 del self.joke[0] self.getOneJoke(pageJokes, page) spider = QSBK() spider.start()
註釋已經附上,其中有幾點須要注意的地方:python
1.須要加上header驗證進行假裝,不然沒法抓取網頁內容正則表達式
2.正則表達式的書寫,須要將內容提取出來以驗證是否有附帶圖片app
3.getOneJoke函數中格式化輸出段子時,須要在字符串前加上u,不然會報以下錯誤:python爬蟲
Traceback (most recent call last): File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 84, in <module> spider.start() File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 81, in start self.getOneJoke(pageJokes, page) File "D:\coding_file\python_file\TestPython\src\Test\QSBK.py", line 68, in getOneJoke print '段子內容:%s\n第%d頁\t發佈人:%s\t贊:%s' % (joke[1], page, joke[0], joke[2]) UnicodeDecodeError: 'ascii' codec can't decode byte 0xe7 in position 3: ordinal not in range(128)
這是由於Python默認編碼方式爲Unicode,因此joke[0]等也是Unicode編碼,爲了格式化輸出,前面的字符串也須要轉換成Unicode編碼ide