python處理一些亂碼的中文文本時decode('utf-8')報錯的處理

時間 2019-11-13

標籤 python 處理一些亂碼中文文本 decode utf 報錯欄目 Python 简体版

原文原文鏈接

用python寫腳本時，遇處處理中文（亂碼的中文）時，用decode('utf-8')會發現始終會報錯html

>>> txt_from = open('/home/love/ex130705.log')
>>> txt_from_iter= iter(txt_from)
>>> txt_proc = txt_from_iter.next().decode('utf-8', 'ignore')

 Traceback (most recent call last):
  File "/tmp/py4049kjX", line 41, in <module>
    txt_proc = txt_from_iter.next().decode('utf-8')
  File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
    return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode bytes in position 84-85: invalid continuation byte

欲處理的原文件中部分顯示爲亂碼：python

2013-07-05 04:20:10 192.168.1.5 GET /Portals/0/鏁欒偛淇℃伅鏂囦歡澶校園<E5><BF> 80 - 25.XXX.10.99 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+5.1;+Trident/4.0;+Alexa+Toolbar) 404 0 2 234python2.7

2013-07-05 04:20:24 192.168.1.5 GET /Portals/0/鏁欒偛淇℃伅鏂囦歡澶校園<E5><BF> 80 - 25.XXX.10.99 Mozilla/4.0+(compatible;+MSIE+8.0;+Windows+NT+5.1;+Trident/4.0;+Alexa+Toolbar) 404 0 2 296ide

這些顯示亂碼的中文字符是IIS在記錄日誌過程當中出現的。python經過decode('utf-8')解碼爲UTF-8時會拋出異常UnicodeDecodeError。日誌

解決：用 decode('utf-8', 'ignore')code

>>>
>>> txt_proc = txt_from_iter.next().decode('utf-8', 'ignore')
>>>

查看decode的幫助：htm

help("".decode)
decode(...)
    S.decode([encoding[,errors]]) -> object
    
    Decodes S using the codec registered for encoding. encoding defaults
    to the default encoding. errors may be given to set a different error
    handling scheme. Default is 'strict' meaning that encoding errors raise
    a UnicodeDecodeError. Other possible values are 'ignore' and 'replace'
    as well as any other name registered with codecs.register_error that is
    able to handle UnicodeDecodeErrors.

參考：http://blog.sina.com.cn/s/blog_8af1069601015et3.htmlblog