問題：python3 使用beautifulSoup時，出錯UnicodeDecodeError: 'gbk' codec …….

時間 2019-11-13

標籤問題 python3 python 使用 beautifulsoup 出錯 unicodedecodeerror gbk codec 欄目 Python 简体版

原文原文鏈接

想將html文件轉爲純文本，用Python3調用beautifulSouphtml

超簡單的代碼一直出錯，用於打開本地文件：測試

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html')soup = BeautifulSoup(file,'lxml')print (soup)

出現下面的錯誤編碼

UnicodeDecodeError : ‘gbk’ codec can’t decode byte 0xff in position 0: illegal multibyte sequencespa

beautifulSoup不是自稱能夠解析各類編碼格式的嗎？爲何還會出現解析的問題？？？code

搜了不少關於beautifulSoup的都沒有解決，忽然發現，若是把代碼寫成xml

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html')str1 = file.read() # 錯誤出在這一行！！！soup = BeautifulSoup(str1,'lxml')print (soup)

原來如此！ 問題出在文件讀取而非BeautifulSoup的解析上！！htm

好吧，查查爲何文件讀取有問題，直接上正解，一樣四行代碼utf-8

 
 
 
 
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
  
 
 
 
 
from bs4 import BeautifulSoupfile = open('index.html','r',encoding='utf-16-le')soup = BeautifulSoup(file,'lxml')print (soup)

而後soup.get_text()獲得標籤中的文字get

其它

若是文件中存在多種編碼並且報錯，能夠採用下面這種方式忽略，沒測試–it

 
 
 
 
  
  
  
  
 
 
 
 
soup = BeautifulSoup(content.decode('utf-8','ignore'))

From WizNote

相關標籤/搜索

python3+requests+beautifulsoup+mysql