抓取CodeSnippet中的代碼片斷php
<body> <div id="container"> <div class="content bor round"> <ul> <li class="con-logo bbor"> <a href="http://www.codesnippet.cn/index.html" title="分享你的世界"></a> </li> <li class="con-code bbor"> <pre class="brush:php;"> <!--代碼塊--> </pre> </li> <li class="con-btn bbor"> <ul> <li><a href="http://www.codesnippet.cn/pcode.html" class="button">發佈代碼片斷</a></li> <li><a href="http://www.codesnippet.cn/list.html" class="button">片斷列表</a></li> </ul> <br class="clearfloat" /> </li> <li class="con-motto bbor"> <div>一個線程若是是我的英雄主義,那麼多線程就是集體主義,你再也不是一個獨行俠,而是一個指揮家。</div> </li> <li class="con-count bbor"> <div> 共有<span> {15106} </span>個代碼片斷 </div> </li> <li class="con-copyright"> <div>京ICP備13038605號 <script src="http://s14.cnzz.com/stat.php?id=4720394&web_id=4720394" language="JavaScript"></script> </div> </li> </ul> </div> </div> </body>
咱們想要抓取的內容在爲 li class="con-code bbor"
因此 BeautifulSoup find()方法獲取到該標籤而後獲取其文本內容html
準備咱們爬蟲比用的兩個模塊python
from urllib2 import urlopen from bs4 import BeautifulSoup
# 抓取http://www.codesnippet.cn/index.html 中的代碼片斷 def GrapIndex(): html = "http://www.codesnippet.cn/index.html" bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find("li", {"class":"con-code bbor"}).get_text()
當咱們抓取到咱們想要的數據以後接下來要作的就是把數據寫到數據庫裏,因爲咱們如今抓取數據簡單,因此只寫文件便可!web
def SaveResult(): codeFile=open("code.txt", "a") # 追加 for list in GrapIndex(): codeFile.write(list) codeFile.close()
UnicodeEncodeError: 'ascii' codec can't encode character u'u751f' in position 0: ordinal not in range(128)數據庫
python2.7是基於ascii去處理字符流,當字符流不屬於ascii範圍內,就會拋出異常(ordinal not in range(128))多線程
import sys reload(sys) sys.setdefaultencoding('utf-8')
from urllib2 import urlopen from bs4 import BeautifulSoup import os import sys reload(sys) sys.setdefaultencoding('utf-8') def GrapIndex(): html = "http://www.codesnippet.cn/index.html" bsObj = BeautifulSoup(urlopen(html), 'html.parser') return bsObj.find("li", {"class":"con-code bbor"}).get_text() def SaveResult(): codeFile=open("code.txt", "a") for list in GrapIndex(): codeFile.write(list) codeFile.close() if __name__ == '__main__': for i in range(0,9): SaveResult()