Python3 處理 gb18030 亂碼

時間 2019-12-30

標籤 python3 python 處理 gb18030 亂碼欄目 Python 简体版

原文原文鏈接

【環境】php

Windows 10 x64html
Python 3.6.3
python

【關於 gb18030 編碼】app

GB 18030 wiki：https://zh.wikipedia.org/wiki/GB_18030ide
單字節，其值從0到0x7F。編碼
雙字節，第一個字節的值從0x81到0xFE，第二個字節的值從0x40到0xFE（不包括0x7F）。code
四字節，第一個字節的值從0x81到0xFE，第二個字節的值從0x30到0x39，第三個字節從0x81到0xFE，第四個字節從0x30到0x39。htm

【解碼錯誤的處理方式】對象

錯誤：
blog

UnicodeDecodeError: 'gb18030' codec can't decode byte 0xff in position 129535: illegal multibyte sequence

bytes.decode
codecs.register_error 樣例
異常對象：UnicodeDecodeError
方案一：自定義 replace_errors：

import codecs

# gb18030 亂碼 handler
def WalkerGB18030ReplaceHandler(exc):
	print('exc.start: %d' % exc.start)
	print('exc.end: %d' % exc.end)
	print('exc.encoding: %s' % exc.encoding)
	print('exc.reason: %s' % exc.reason)
	text = ''
	for ch in exc.object[exc.start:exc.end]:
		print('ch:')
		print(ch)
		text += ('0x%02X' % ch)
		
	return (text, exc.end)
	
# 註冊自定義handler
codecs.register_error("myreplace", WalkerGB18030ReplaceHandler)

* 方案二：自定義編碼清洗

# 修理 gb18030文件
# 將亂碼轉化爲十六進制字符串，例如：b'\xff' 轉爲字符串 0xFF
# 將不可打印單字節轉爲十六進制字符串，例如：b'\xff' 轉爲字符串 0x7F
# srcFile 爲原始 gb18030文件
# dstFile 爲修理後的 gb18030文件
# explicit 控制是否轉換爲不可打印字符： explicit 爲 False 是不轉換（默認），不然轉換
def RepairGB18030File(srcFile, dstFile, explicit=False):
	with open(srcFile, mode='rb') as fin:
		byteText = fin.read()
	byteLength = len(byteText)	
	print('byteLength: %d' % byteLength)
	
	pos = 0		# 位置
	byteList = list()
	# 末尾添加2對\r\n防止pos溢出
	byteText += b'\x0d\x0a\x0d\x0a'
	while pos < byteLength:	
		byte1 = bytes([byteText[pos]])
		byte2 = bytes([byteText[pos+1]])
		byte3 = bytes([byteText[pos+2]])
		byte4 = bytes([byteText[pos+3]])
		
		# 單字節漢字（正常）
		if b'\x00' <= byte1 <= b'\x7f':		
			pos += 1
			if byte1.decode('gb18030').isprintable(): # 可打印字符
				byteList.append(byte1)
				continue
				
			if byte1 in (b'\x0d', b'\x0a'): # 換行符
				byteList.append(byte1)
				continue
				
			if explicit:	# 要求轉換不可打印字符	
				byteNew = ("0x%02X" % ord(byte1)).encode('gb18030')	
				byteList.append(byteNew)	
			else:			# 不要求轉換不可打印字符
				byteList.append(byte1)			
				
		# 多字節漢字（雙字節或四字節）		
		elif b'\x81' <= byte1 <= b'\xfe':	
			#雙字節（正常）
			if (b'\x40' <= byte2 <= b'\x7e') or (b'\x80' <= byte2 <= b'\xfe'):	
				pos += 2
				byteList.extend([byte1, byte2])
				continue
				
			#四字節	
			if b'\x30' <= byte2 <= b'\x39':	
				# 四字節（正常）
				if (b'\x81' <= byte3 <= b'\xfe') or (b'\x30' <= byte4 <= b'\x39'):
					pos += 4
					byteList.extend([byte1, byte2, byte3, byte4])
					continue
				
				# 四字節亂碼
				pos += 1	#錯誤的時候只能移動一個字節
				byteNew = ("0x%02X" % ord(byte1)).encode('gb18030')
				byteList.append(byteNew)
				continue
			
			# 雙字節亂碼
			#0x00-0x2f、0x7f、0xff
			pos += 1	#錯誤的時候只能移動一個字節
			byteNew = ("0x%02X" % ord(byte1)).encode('gb18030')
			byteList.append(byteNew)
		else:
			# 單字節亂碼		
			#應該只剩 0x80 和 0xff
			byteNew = ("0x%02X" % ord(byte1)).encode('gb18030')	#4個字節
			pos += 1	#錯誤的時候只能移動一個字節
			byteList.append(byteNew)
		
	repairedText = b''.join(byteList).decode('gb18030')
	
	with open(dstFile, mode='w', encoding='gb18030') as fout:
		fout.write(repairedText)

【相關閱讀】

一、關於 Python3 的編碼

二、漢字字符集編碼查詢

*** walker 的流水帳 ***