想將html文件轉爲純文本,用Python3調用beautifulSouphtml
超簡單的代碼一直出錯,用於打開本地文件:測試
from bs4 import BeautifulSoupfile = open('index.html')soup = BeautifulSoup(file,'lxml')print (soup)
出現下面的錯誤編碼
UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequencespa
beautifulSoup不是自稱能夠解析各類編碼格式的嗎?爲何還會出現解析的問題???code
搜了不少關於beautifulSoup的都沒有解決,忽然發現,若是把代碼寫成xml
from bs4 import BeautifulSoupfile = open('index.html')str1 = file.read() # 錯誤出在這一行!!!soup = BeautifulSoup(str1,'lxml')print (soup)
原來如此! 問題出在文件讀取而非BeautifulSoup的解析上!!htm
好吧,查查爲何文件讀取有問題,直接上正解,一樣四行代碼utf-8
from bs4 import BeautifulSoupfile = open('index.html','r',encoding='utf-16-le')soup = BeautifulSoup(file,'lxml')print (soup)
而後soup.get_text()獲得標籤中的文字get
若是文件中存在多種編碼並且報錯,能夠採用下面這種方式忽略,沒測試–it
soup = BeautifulSoup(content.decode('utf-8','ignore'))