beautifulsoup的官方中文文檔:http://www.crummy.com/software/BeautifulSoup/bs4/doc/index.zh.htmlhtml
1.從urlopen中讀取url,而後傳入beautifulsoup,beautifulsoup默認網頁編碼格式是UTF-8,若是是GBK之類的會顯示python
WARNING:root:Some characters could not be decoded, and were replaced with REPLACEMENT CHARACTER.編碼
就是說beautifulsoup看不懂這個網頁,也沒法解析網頁。url
好比http://www.sina.com.cn/ 使用的就是gb2312(爲何就不能用UTF啊,浪費我時間!!)spa
fg=urllib.request.urlopen('http://www.sina.com.cn/') beautifulsoup(fg)
就顯示上面的WARNINGcode
若是把新浪改爲百度就能夠正常使用,至於如何讀取新浪,戳這裏。htm
2.改變beautifulsoup的默認編碼blog
c.BeautifulSoup(page, from_encoding='gb2312')文檔