起初發現了以下的現象:php
mysql> show variables like 'character%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | latin1 | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | latin1 | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ | +--------------------------+---------------------------------------+ mysql> show create table t4\G *************************** 1. row *************************** Table: t4 Create Table: CREATE TABLE `t4` ( `data` varchar(100) DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 mysql> insert into t4 select '\U+1F600';
以爲很奇怪怎麼latin1也支持emoji字符了呢?不是隻有utf8mb4才支持嗎? 因而在StackOverFlow上提問,一個網友的回答以爲有道理,回答以下:html
I think you saved into and retrieved from the database a string of bytes that is interpreted by the terminal as an Unicode character. Check the output of SELECT LENGTH(data), CHAR_LENGTH(data) FROM t4 to see what's happening. They should return different values for multi-byte characters and the same value forlatin1. – axiac 19 hours ago
在加上無心中看到了一篇博客, 其中說道:java
拋一個問題,latin1字符集的表,用戶寫入和讀取漢字是否有問題?答案是隻要合理設置,沒有問題。假設SecureCRT爲UTF8,character_set_client和表字符集均設置爲latin1,參考第3節的分析,那麼用戶讀取和寫入數據的過程當中,並不涉及字符集編碼轉換的問題,將UTF8的漢字字符轉爲二進制流寫入database,提取出來後,secureCRT再將對應的二進制解碼爲對應的漢字,因此不影響用戶的使用。
因而如今以爲上述現象很正常。mysql
由於操做系統默認的字符集爲utf8(LANG=en_US.UTF-8), 而client、connection、database均爲latin1, 因而這一路(從終端界面執行insert到保存數據到表中)都沒有編碼轉換,直接傳輸的是utf8編碼後的二進制流。
怎麼驗證上述結論呢? 因而決定修改中間環節的字符集,看會發生什麼?web
mysql> set names gbk; mysql> show variables like 'character%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | gbk | | character_set_connection | gbk | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ | +--------------------------+---------------------------------------+ mysql> insert into t4 select '\U+1F600'; ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1
分析:sql
如今操做系統是utf8, client、connection是gbk, 字段是latin1, 由於一開始是utf8二進制流,且client和connection均爲gbk,無需轉碼,故只在最後當保存到表字段中時須要由utf8轉爲latin1,因爲latin1不能解碼該utf8二進制流故致使了上述報錯。數據庫
若將字符集不一致的狀況再往前挪一步會怎樣呢? 以下所示:
app
mysql> set character_set_connection = latin1; mysql> show variables like 'character%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | gbk | | character_set_connection | latin1 | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ | +--------------------------+---------------------------------------+
如今client和connection就不一致了,就是說須要先將utf8-->gbk-->latin1, 那麼如今能成功插入emoji字符嗎? 單元測試
mysql> insert into t4 select '\U+1F600';
能夠插入,查詢結果以下:測試
mysql> select data,hex(data) from t4; +------+-----------+ | data | hex(data) | +------+-----------+ | ?? | 3F3F | +------+-----------+
彷佛在utf8-->gbk的過程當中,將utf8編碼後的二進制流(f0 9f 98 80)解碼成了‘??’,而‘??’能被latin1成功解析。但如何經過java程序模擬上述的轉換呢?
通過嘗試發現下面的代碼能夠模擬數據庫操做情形,以下所示:
/** * os utf-8 * character_set_client gbk * character_set_connection latin1 * field latin1 * * @throws UnsupportedEncodingException */ @Test public void test_os_utf8_to_client_gbk_to_connection_latin1() throws UnsupportedEncodingException{ String emoji = ...; //因該博客系統不支持Emoji字符 故用省略號表示 String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk) System.out.println(receivedStr);//餜榾 /** * 若client與connection不一致 轉換時統一使用connection的字符集 */ String convertedStr = new String(receivedStr.getBytes("latin1"),"latin1"); //client(gbk) --> connection(latin1) System.out.println(convertedStr);//?? printHexString(convertedStr.getBytes("latin1")); //3f 3f }
那假如將上例中的client與connection交換一下位置呢,以下所示:
mysql> show variables like 'character%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | latin1 | | character_set_connection | gbk | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ | +--------------------------+---------------------------------------+
如今的轉化流變成這樣了:utf8-->latin1-->gbk-->latin1, 從以前的經驗彷佛能夠預測進行第一步轉化時就應該報錯(Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1),但實際狀況是:
mysql> insert into t4 select '\U+1F600'; Query OK, 1 row affected (0.01 sec) mysql> select data,hex(data) from t4; +------+-----------+ | data | hex(data) | +------+-----------+ | ?? | 3F3F | | ???? | 3F3F3F3F | +------+-----------+
並未報錯仍能成功插入, 彷佛只要不是最後一步往表裏插入記錄就不會報錯,但此次變成4個問號了。
此次對應的java模擬程序以下所示:
/** * os utf-8 * character_set_client latin1 * character_set_connection gbk * field latin1 * @throws UnsupportedEncodingException */ @Test public void test_os_utf8_to_client_latin1_to_connection_gbk_to_field_latin1() throws UnsupportedEncodingException{ String emoji = ...; String receivedStr = new String(emoji.getBytes("utf-8"),"latin1"); //os(utf-8)-->client(latin1) System.out.println(receivedStr); // 若client與connection不一致 統一使用connection字符集 String convertedStr = new String(receivedStr.getBytes("gbk"),"gbk"); //client(latin1) --> connection(gbk) System.out.println(convertedStr);//???? String savedStr = new String(convertedStr.getBytes("gbk"),"latin1"); // connection(gbk) --> field(latin1) System.out.println(savedStr);//???? printHexString(savedStr.getBytes("latin1")); //3f 3f 3f 3f }
再看一種狀況,若是字段的字符集爲utf8呢? 以下所示:
mysql> show create table t6\G *************************** 1. row *************************** Table: t6 Create Table: CREATE TABLE `t6` ( `data` varchar(100) CHARACTER SET utf8 DEFAULT NULL ) ENGINE=InnoDB DEFAULT CHARSET=latin1 mysql> show variables like 'character%'; +--------------------------+---------------------------------------+ | Variable_name | Value | +--------------------------+---------------------------------------+ | character_set_client | gbk | | character_set_connection | gbk | | character_set_database | latin1 | | character_set_filesystem | binary | | character_set_results | gbk | | character_set_server | utf8mb4 | | character_set_system | utf8 | | character_sets_dir | /opt/mysql/server-5.6/share/charsets/ | +--------------------------+---------------------------------------+
可否成功插入呢?是否也會報如上情形中的Incorrect String value錯誤呢?
mysql> insert into t6 select '\U+1F600'; Query OK, 1 row affected (0.00 sec)
此次能夠成功插入,但字節流不在是f09f9880,而是e9a683e6a6be. 對應的java模擬程序爲:
/** * os utf-8 * character_set_client gbk * character_set_connection gbk * filed utf8 * @throws UnsupportedEncodingException */ @Test public void test_os_utf8_to_gbk_to_field_utf8() throws UnsupportedEncodingException{ String emoji = ...; String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk) System.out.println(receivedStr); //餜榾 /** * 若client與connection一致時 使用默認的字符集 */ String savedStr = new String(receivedStr.getBytes(),"utf-8"); // connection(gbk) --> field(utf8) System.out.println(savedStr);//餜榾 printHexString(savedStr.getBytes("utf-8")); //e9 a6 83 e6 a6 be }
再用此模擬程序模擬field爲latin1時報錯時的情形,
/** * os utf-8 * character_set_client gbk * character_set_connection gbk * filed latin1 * @throws UnsupportedEncodingException */ @Test public void test_os_utf8_to_gbk_to_field_latin1() throws UnsupportedEncodingException{ String emoji = ...; String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk) System.out.println(receivedStr); //餜榾 /** * 若client與connection一致時 使用默認的字符集 */ String savedStr = new String(receivedStr.getBytes(),"latin1"); // connection(gbk) --> field(latin1) System.out.println(savedStr);// printHexString(savedStr.getBytes("latin1")); //e9 a6 83 e6 a6 be }
發現最後保存到字段中的字節流是同樣的 均是e9a683e6a6be, 爲何只有字段字符集爲latin1時才報錯呢?且報錯的信息是:
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1
而不是
Incorrect string value: '\xE9\xA6\x83\xE6\xA6\xBE'
呢?
補充:
java程序模擬出實際數據庫操做情形總結:
若client、connection、filed字符集均一致,直接保存的就是用操做系統默認字符集編碼後的二進制流。
若client與connection一致,但field不一樣,當由connection轉爲field的字符集時,使用操做系統的字符集。
若client與connection不一致,使用connection字符集。
單元測試代碼補充:
private void printHexString(byte[] bytes) throws UnsupportedEncodingException { for(byte b : bytes) System.out.print(byteToHexStr(b)+" "); System.out.println("\n"); } private String byteToHexStr(byte b){ int i = b; if(i<0) i = 256 - (i*-1); String hex = Integer.toHexString(i); if(hex.length()==1) return "0"+hex; else { return hex; } }
參考文檔:
https://dev.mysql.com/doc/refman/5.0/en/charset-connection.html