Python3.x：BeautifulSoup()解決中文亂碼問題

時間 2019-11-17

標籤 python3.x python beautifulsoup 解決中文亂碼問題欄目 Python 简体版

原文原文鏈接

Python3.x：BeautifulSoup()解決中文亂碼問題

問題：

　　BeautifulSoup獲取網頁內容，中文顯示亂碼；html

解決方案：

　　遇到狀況也是比較奇葩，利用chardet獲取網頁編碼，而後在BeautifulSoup構造器中傳入from_encoding=參數，獲取的仍是一堆亂碼；網絡

無奈之下，在網絡上大搜索一通，結果仍是沒搞清楚緣由，可是問題卻是找到了解決方案；ide

在這裏提供下，給遇到一樣問題的碼友：編碼

若是中文頁面編碼是gb2312，gbk，在BeautifulSoup構造器中傳入from_encoding="gb18030"參數便可解決亂碼問題，url

即便分析的頁面是utf8的頁面使用gb18030也不會出現亂碼問題；spa

import requests
from bs4 import BeautifulSoup
all_url = ""
start_html= requests.get(all_url, headers=Hostreferer)
#若是中文頁面編碼是gb2312，gbk，在BeautifulSoup構造器中傳入from_encoding="gb18030"參數便可解決亂碼問題，即便分析的頁面是utf8的頁面使用gb18030也不會出現亂碼問題
soup = BeautifulSoup(start_html.content, "html.parser", from_encoding="gb18030")

這裏chardet的方式也貼出來，供你們參考：code

import urllib.request 
import chardet 
all_url = ""
charset1=chardet.detect(urllib.request.urlopen(all_url).read() )
print(charset1)
#輸出結果： {'encoding': 'GB2312', 'confidence': 0.99, 'language': 'Chinese'}
bmfs = charset1['encoding']
print(bmfs)
#輸出結果：GB2312

soup = BeautifulSoup(start_html.content, "html.parser", from_encoding=bmfs)