Mysql latin1也支持emoji字符的錯覺分析

起初發現了以下的現象:php

mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | latin1                                |
| character_set_connection | latin1                                |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | latin1                                |
| character_set_server     | utf8mb4                               |
| character_set_system     | utf8                                  |
| character_sets_dir       | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> show create table t4\G
*************************** 1. row ***************************
   Table: t4
Create Table: CREATE TABLE `t4` (
  `data` varchar(100) DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> insert into t4 select '\U+1F600';

以爲很奇怪怎麼latin1也支持emoji字符了呢?不是隻有utf8mb4才支持嗎? 因而在StackOverFlow上提問,一個網友的回答以爲有道理,回答以下:html

I think you saved into and retrieved from the database a string of bytes that is interpreted by the terminal as an Unicode character. Check the output of SELECT LENGTH(data), CHAR_LENGTH(data) FROM t4 to see what's happening. They should return different values for multi-byte characters and the same value forlatin1. –  axiac 19 hours ago

在加上無心中看到了一篇博客, 其中說道:java

拋一個問題,latin1字符集的表,用戶寫入和讀取漢字是否有問題?答案是隻要合理設置,沒有問題。假設SecureCRT爲UTF8,character_set_client和表字符集均設置爲latin1,參考第3節的分析,那麼用戶讀取和寫入數據的過程當中,並不涉及字符集編碼轉換的問題,將UTF8的漢字字符轉爲二進制流寫入database,提取出來後,secureCRT再將對應的二進制解碼爲對應的漢字,因此不影響用戶的使用。

因而如今以爲上述現象很正常。mysql

由於操做系統默認的字符集爲utf8(LANG=en_US.UTF-8), 而client、connection、database均爲latin1, 因而這一路(從終端界面執行insert到保存數據到表中)都沒有編碼轉換,直接傳輸的是utf8編碼後的二進制流。

怎麼驗證上述結論呢? 因而決定修改中間環節的字符集,看會發生什麼?web

  1.  

mysql> set names gbk;
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | gbk                                   |
| character_set_connection | gbk                                   |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | gbk                                   |
| character_set_server     | utf8mb4                               |
| character_set_system     | utf8                                  |
| character_sets_dir       | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+
mysql> insert into t4 select '\U+1F600';
ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1

分析:sql

如今操做系統是utf8, client、connection是gbk, 字段是latin1, 由於一開始是utf8二進制流,且client和connection均爲gbk,無需轉碼,故只在最後當保存到表字段中時須要由utf8轉爲latin1,因爲latin1不能解碼該utf8二進制流故致使了上述報錯。數據庫

若將字符集不一致的狀況再往前挪一步會怎樣呢? 以下所示:
app

mysql> set character_set_connection = latin1;
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | gbk                                   |
| character_set_connection | latin1                                |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | gbk                                   |
| character_set_server     | utf8mb4                               |
| character_set_system     | utf8                                  |
| character_sets_dir       | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+

如今client和connection就不一致了,就是說須要先將utf8-->gbk-->latin1, 那麼如今能成功插入emoji字符嗎? 單元測試

mysql> insert into t4 select '\U+1F600';

能夠插入,查詢結果以下:測試

mysql> select data,hex(data) from t4;
+------+-----------+
| data | hex(data) |
+------+-----------+
| ??   | 3F3F      |
+------+-----------+

彷佛在utf8-->gbk的過程當中,將utf8編碼後的二進制流(f0 9f 98 80)解碼成了‘??’,而‘??’能被latin1成功解析。但如何經過java程序模擬上述的轉換呢?

通過嘗試發現下面的代碼能夠模擬數據庫操做情形,以下所示:

/**
 * os utf-8
 * character_set_client gbk
 * character_set_connection latin1
 * field latin1
 * 
 * @throws UnsupportedEncodingException 
 */
@Test
public void test_os_utf8_to_client_gbk_to_connection_latin1() throws UnsupportedEncodingException{
	String emoji = ...; //因該博客系統不支持Emoji字符 故用省略號表示
	String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk)
	System.out.println(receivedStr);//餜榾
	/**
	 *  若client與connection不一致 轉換時統一使用connection的字符集
	 */
	String convertedStr = new String(receivedStr.getBytes("latin1"),"latin1"); //client(gbk) --> connection(latin1)
	System.out.println(convertedStr);//??
	printHexString(convertedStr.getBytes("latin1")); //3f 3f
}

那假如將上例中的client與connection交換一下位置呢,以下所示:

mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | latin1                                |
| character_set_connection | gbk                                   |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | gbk                                   |
| character_set_server     | utf8mb4                               |
| character_set_system     | utf8                                  |
| character_sets_dir       | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+

如今的轉化流變成這樣了:utf8-->latin1-->gbk-->latin1, 從以前的經驗彷佛能夠預測進行第一步轉化時就應該報錯(Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1),但實際狀況是:

mysql> insert into t4 select '\U+1F600';
Query OK, 1 row affected (0.01 sec)
mysql> select data,hex(data) from t4;
+------+-----------+
| data | hex(data) |
+------+-----------+
| ??   | 3F3F      |
| ???? | 3F3F3F3F  |
+------+-----------+

並未報錯仍能成功插入, 彷佛只要不是最後一步往表裏插入記錄就不會報錯,但此次變成4個問號了。

此次對應的java模擬程序以下所示:

/**
 * os utf-8
 * character_set_client latin1
 * character_set_connection gbk
 * field latin1
 * @throws UnsupportedEncodingException 
 */
@Test
public void test_os_utf8_to_client_latin1_to_connection_gbk_to_field_latin1() throws UnsupportedEncodingException{

	String emoji = ...;
	String receivedStr = new String(emoji.getBytes("utf-8"),"latin1"); //os(utf-8)-->client(latin1)
	System.out.println(receivedStr); Ÿ
	// 若client與connection不一致 統一使用connection字符集
	String convertedStr = new String(receivedStr.getBytes("gbk"),"gbk"); //client(latin1) --> connection(gbk)
	System.out.println(convertedStr);//????
	String savedStr = new String(convertedStr.getBytes("gbk"),"latin1"); // connection(gbk) --> field(latin1)
	System.out.println(savedStr);//????
	printHexString(savedStr.getBytes("latin1")); //3f 3f 3f 3f 
}


再看一種狀況,若是字段的字符集爲utf8呢? 以下所示:

mysql> show create table t6\G
*************************** 1. row ***************************
       Table: t6
Create Table: CREATE TABLE `t6` (
  `data` varchar(100) CHARACTER SET utf8 DEFAULT NULL
) ENGINE=InnoDB DEFAULT CHARSET=latin1
mysql> show variables like 'character%';
+--------------------------+---------------------------------------+
| Variable_name            | Value                                 |
+--------------------------+---------------------------------------+
| character_set_client     | gbk                                   |
| character_set_connection | gbk                                   |
| character_set_database   | latin1                                |
| character_set_filesystem | binary                                |
| character_set_results    | gbk                                   |
| character_set_server     | utf8mb4                               |
| character_set_system     | utf8                                  |
| character_sets_dir       | /opt/mysql/server-5.6/share/charsets/ |
+--------------------------+---------------------------------------+

可否成功插入呢?是否也會報如上情形中的Incorrect String value錯誤呢?

mysql> insert into t6 select '\U+1F600';
Query OK, 1 row affected (0.00 sec)

此次能夠成功插入,但字節流不在是f09f9880,而是e9a683e6a6be. 對應的java模擬程序爲:

/**
 * os utf-8
 * character_set_client gbk
 * character_set_connection gbk
 * filed utf8
 * @throws UnsupportedEncodingException 
 */
@Test
public void test_os_utf8_to_gbk_to_field_utf8() throws UnsupportedEncodingException{
	String emoji = ...;
	String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk)
	System.out.println(receivedStr); //餜榾
	/**
	 *  若client與connection一致時 使用默認的字符集
	 */
	String savedStr = new String(receivedStr.getBytes(),"utf-8"); // connection(gbk) --> field(utf8)
	System.out.println(savedStr);//餜榾
	printHexString(savedStr.getBytes("utf-8")); //e9 a6 83 e6 a6 be 
}

再用此模擬程序模擬field爲latin1時報錯時的情形,

/**
 * os utf-8
 * character_set_client gbk
 * character_set_connection gbk
 * filed latin1
 * @throws UnsupportedEncodingException 
 */
@Test
public void test_os_utf8_to_gbk_to_field_latin1() throws UnsupportedEncodingException{
	String emoji = ...;
	String receivedStr = new String(emoji.getBytes("utf-8"),"gbk"); //os(utf-8)-->client(gbk)
	System.out.println(receivedStr); //餜榾
	/**
	 *  若client與connection一致時 使用默認的字符集
	 */
	String savedStr = new String(receivedStr.getBytes(),"latin1"); // connection(gbk) --> field(latin1)
	System.out.println(savedStr);//
	printHexString(savedStr.getBytes("latin1")); //e9 a6 83 e6 a6 be 
}

發現最後保存到字段中的字節流是同樣的 均是e9a683e6a6be, 爲何只有字段字符集爲latin1時才報錯呢?且報錯的信息是:

ERROR 1366 (HY000): Incorrect string value: '\xF0\x9F\x98\x80' for column 'data' at row 1

而不是

Incorrect string value: '\xE9\xA6\x83\xE6\xA6\xBE'

呢?

補充:

java程序模擬出實際數據庫操做情形總結:

  1. 若client、connection、filed字符集均一致,直接保存的就是用操做系統默認字符集編碼後的二進制流。

  2. 若client與connection一致,但field不一樣,當由connection轉爲field的字符集時,使用操做系統的字符集。

  3. 若client與connection不一致,使用connection字符集。

單元測試代碼補充:

private void printHexString(byte[] bytes)
		throws UnsupportedEncodingException {
	for(byte b : bytes)
		System.out.print(byteToHexStr(b)+" ");
	System.out.println("\n");
}
private String byteToHexStr(byte b){
    int i = b;
    if(i<0)
	i = 256 - (i*-1);
    String hex = Integer.toHexString(i);
    if(hex.length()==1)
	return "0"+hex;
    else {
		return hex;
	}
}

參考文檔:

https://dev.mysql.com/doc/refman/5.0/en/charset-connection.html

http://www.cnblogs.com/cchust/p/4327019.html

http://mysql.rjweb.org/doc.php/charcoll

相關文章
相關標籤/搜索