Java Unicode編碼及 Mysql utf8 utf8mb3 utf8mb4 的區別與utf8mb4的過濾

時間 2019-11-17

標籤 java unicode 編碼 mysql utf8 utf utf8mb3 utf8mb4 區別過濾欄目 Java 简体版

原文原文鏈接

內容簡介

本文主要介紹了UTF8的一些基本概念，簡要介紹了mysql中 utf8 utf8mb3 utf8mb4 的區別；而後爲介紹Java對Unicode編碼的支持，引入了一些編碼的基本概念，包括code point， code unit等，並介紹了Java提供的經常使用的支持Unicode編碼的方法；最後給出了過濾UTF8mb4的方案html

UTF-8簡介

UTF-8（8-bit Unicode Transformation Format）是一種針對Unicode的可變長度字符編碼，也是一種前綴碼。它能夠用來表示Unicode標準中的任何字符，且其編碼中的第一個字節仍與ASCII兼容，這使得原來處理ASCII字符的軟件無須或只須作少部分修改，便可繼續使用。所以，它逐漸成爲電子郵件、網頁及其餘存儲或發送文字的應用中，優先採用的編碼。java

UTF-8使用一至四個字節爲每一個字符編碼（2003年11月UTF-8被RFC 3629從新規範，只能使用原來Unicode定義的區域，U+0000到U+10FFFF，也就是說最多四個字節）：mysql

128個US-ASCII字符只需一個字節編碼（Unicode範圍由U+0000至U+007F）。sql
帶有附加符號的拉丁文、希臘文、西裏爾字母、亞美尼亞語、希伯來文、阿拉伯文、敘利亞文及它拿字母則須要兩個字節編碼（Unicode範圍由U+0080至U+07FF）。數據庫
其餘基本多文種平面（BMP, Basic Multilingual Plane）中的字符（這包含了大部分經常使用字，例如CJVK經常使用字字符集 —— Chinese, Japanese, Vietnam, Korean）使用三個字節編碼（Unicode範圍由U+0800至U+FFFF）。api
其餘使用極少的Unicode 輔助平面（Supplementary Multilingual Plane）的字符使用四字節編碼（Unicode範圍由U+10000至U+10FFFF，主要包括不經常使用的CJK字符, 數學符號, emoji表情等）。oracle

utf-8編碼方式
app

unicode code point table
ui

參考與擴展：
維基百科 UTF-8 https://en.wikipedia.org/wiki/UTF-8, 中文版 https://zh.wikipedia.org/wiki/UTF-8
維基百科 Plane_(Unicode) https://en.wikipedia.org/wiki/Plane_%28Unicode%29
維基百科 CJK characters https://en.wikipedia.org/wiki/CJK_characters
維基百科 Emoji https://en.wikipedia.org/wiki/Emoji編碼

UTF-8與Unicode的關係

utf8編碼是unicode編碼的一種實現，能夠簡單的理解爲unicode編碼定義一串數字來一一對應咱們用到的字符，utf8定義瞭如何將unicode定義的這串數字保存到內存中。另外須要強調的是utf8是一種變長的編碼規範。
unicode 的範圍 U+0000 - U+10FFFF。

參考與擴展
維基百科 Unicode https://en.wikipedia.org/wiki/Unicode

Mysql中的 UTF-八、UTF8mb3， UTF8mb4

utf8mb4, MySQL在5.5.3以後增長了這個utf8mb4的編碼，mb4就是most bytes 4的意思，專門用來兼容四字節的unicode字符。
mysql中的utf8，就是最大3字節的unicode字符，也就是mysql中的utf8mb3.

參考
mysql-charset-unicode-utf8mb3 https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb3.html and https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8.html
mysql-charset-unicode-utf8mb4 https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

表示範圍：

說明	mysql utf8 / utf8mb3	mysql utf8mb4
max bit	3	4
範圍	基本多文種平面 + US-ASCII	輔助平面(Supplementary) + 基本多文種平面 + US-ASCII
unicode範圍	U+0000 - U+FFFF	U+0000 - U+10FFFFF
常見字符	英文字母，CJK大部分經常使用字等	CJK很是用字，數學符號，emoji表情等

那麼問題來了，若是用了utf8mb3編碼的mysql數據庫，在插入一些4字節長的字符時就會報錯（形如："java.sql.SQLException: Incorrect string value: '\xF0\x9F\x94\x91\xE6\x9D...' for column 'core_data' at row 1" 的錯誤），後文會介紹如何在Java中過濾掉這些字符。

要在Java中過濾Mysql的utf8mb4，必須弄清Java是如何支持Unicode編碼，接下來徐徐展開......

編碼簡介

下面先介紹幾個概念：character（字符）, character set（字符集）, coded character set（字符編碼集）, code point（代碼點）, code space（代碼空間），character encoding scheme（字符編碼方案），code unit（編碼單元），和3種Unicode經常使用的編碼方式。

character——字符，'a', '€', '中' 等, 都是一個字符
character set——字符集，字符的集合
coded character set——字符編碼集，爲每個字符指定一個惟一的數字用來表示這個字符，這些數字組成的集合就是字符編**碼集合，Unicode就是一個字符編碼集
code point——代碼點，是一個數字，用來表示字符集中的一個字符，也就是字符編碼集中的一個數，例如 Unicode 編碼中, 'A'的code point就是65（在Unicode中一般寫做 U+0041）
code space——代碼空間，就是一個編碼集中，code point的範圍，例如 Unicode 編碼的 code space 就是 0x0000 - 0x10FFFF
character encoding scheme——字符編碼方案，它定義了將字符用一個或多個固定長度的代碼單元的方案，如前文提到的"utf-8編碼方式"就是一個字符編碼方案，其它的還有UTF16，UTF32，GBK等等
code unit——編碼單元，就是編碼方案中固定長度的最小編碼單元，如UTF8的編碼單元是1bit，UTF16是2bit，UTF32是4bit，

Unicode經常使用的三種編碼方式 UTF-8, UTF-16, UTF-32，下面以輔助平面中的字符'🔑' 爲例作一個簡要的介紹，它的code point爲128273（0x1F511）:

utf8，編碼單元爲8bit，使用1-4個編碼單元來表示Unicode中的字符，輔助平面中的字符在utf8中須要用4字節表示，對照前面的utf-8編碼方案中4字節的編碼格式, 從高到低依次爲：11110xxx 10xxxxxx 10xxxxxx 10xxxxxx，因此其編碼是編碼是 '11110000 10011111 10010100 10010001'，注意並非 0x1F511的二進制表示，不要混淆
utf16，編碼單元是16bit，用1-2個編碼單元來表示Unicode中的字符，U+0000-U+FFFF（BMP）用一個編碼單元表示，0x10000-0x10FFFF（SMP）用兩個編碼單元（high-surrogates和low-surrogates）表示，high-surrogates範圍U+D800-U+DBFF，low-surrogates範圍U+DC00-U+DFFF，編碼方式見下文圖片，編碼結果爲'11011000 00111101 11011101 00010001'。在Unicode編碼中U+D800-U+DFFF是專門爲UTF16保留的區間，沒有分配其它字符，因此不用擔憂一個code point有兩個含義的問題。
utf32，編碼半圓是32bit，能夠只用一個編碼單元來表示所有的Unicode字符，其編碼就是 code point的值，也就是 '00000000 00000001 11110101 00010001'。

UTF-8編碼方式

UTF-16編碼方式

打印編碼的code：

@Test
    public void printCharacterCode() {
        String s = "\uD83D\uDD11"; //字符'🔑'
        log.info("UTF8: {}", bytesToBits(s.getBytes(Charset.forName("utf-8"))));
        log.info("UTF16: {}", bytesToBits(s.getBytes(Charset.forName("utf-16"))));
        log.info("UTF32: {}", bytesToBits(s.getBytes(Charset.forName("utf-32"))));
    }

    public static String byteToBit(byte b) {
        return ""
                + (byte) ((b >> 7) & 0x1) + (byte) ((b >> 6) & 0x1)
                + (byte) ((b >> 5) & 0x1) + (byte) ((b >> 4) & 0x1)
                + (byte) ((b >> 3) & 0x1) + (byte) ((b >> 2) & 0x1)
                + (byte) ((b >> 1) & 0x1) + (byte) ((b >> 0) & 0x1);
    }

    public static String bytesToBits(byte[] bytes) {
        String s = "";
        for (byte b : bytes) {
            s += byteToBit(b) + " ";
        }
        return s;
    }

使用上面的代碼打印結果以下：

UTF8: 11110000 10011111 10010100 10010001 
UTF16: 11111110 11111111 11011000 00111101 11011101 00010001 
UTF32: 00000000 00000001 11110101 00010001

能夠看到utf-16的結果並不是咱們期待的'11011000 00111101 11011101 00010001', 前面多了一個編碼單元 'FEFF', 這個是這個是Unicode編碼中的 BOM（byte order mark）位，用來表示byte（注意不是bit）的順序，BOM是可選的，若是用那麼它必須出如今字符串的開始（在其它編碼中BOM不會出如今字符串開始，因此能夠用來識別字符串是否Unicode編碼）。

爲何要用BOM位？爲了標識編碼單元的字節序，例如：「奎」的Unicode編碼是594E，「乙」的Unicode編碼是4E59，若是咱們收到UTF-16字節流「594E」，那麼這是「奎」仍是「乙」？若是字符串的字節碼是 'FEFF 4E59'，那麼則表示大端在左（big-endian），這個字是「乙」。

Unicode定義的6種BOM位

BOM位是能夠缺省的，缺省時默認大端在左。

UTFs的屬性概括

參考與擴展
Supplementary Characters in the Java Platform http://www.oracle.com/us/technologies/java/supplementary-142654.html
Unicode surrogate programming with the Java language https://www.ibm.com/developerworks/library/j-unicode/
微機百科 UTF16 https://zh.wikipedia.org/wiki/UTF-16
維基百科 code-point https://en.wikipedia.org/wiki/Code_point
D000-DFFF編碼表 http://jicheng.tw/hanzi/unicode.html?s=D000&e=DFFF
utf bom http://unicode.org/faq/utf_bom.html

Java與Unicode

最初Unicode的編碼數量並無超過65,535 (0xFFFF)，早期Java版本中使用16bit的char表示當時所有的Unicode字符。後來Unicode字符集擴展到了1,114,111 (0x10FFFF)(在Unicode標準2.0用引入了輔助編碼平面SMP，在3.1首次爲SMP的部分編碼分配了字符)， JAVA中的char已經不足以表示Unicode的所有編碼（須要32bit），JSR-204的專家討論了不少方法想要解決這個問題，其中包括：

設計一種新的字符類型char32來替換原有的char
用int來表示code point，同時保留，併爲String和StringBuffer等增長兼容char和int表示的api
...
最後處於內存佔用和兼容性等方面的考慮，採用了以下方法：
在底層api中用int來表示code point，好比在Character類中
全部的字符串都char表示，並採用utf16的格式來表示，並提倡在高層api中使用這種方式
提供便於在int（code point）和char之間轉換的方法，用於必要時候二者的轉換

前文提到了UTF16用兩個編碼單元來表示超過U+FFFF的1,048,576 (1024*1024)個字符，Java中與之對應的概念就是"代理對（surrogate pair）"。

下面介紹Java中幾個經常使用的code point(int)和char的轉換方法

Character.toCodePoint(char high, char low)，return int，將兩個UTF16的char（兩個UTF16代碼單元）轉換爲code point
Character.toChars(int codePoint), return char[]，將code point轉換爲一個或兩個UTF16代碼單元
isSupplementaryCodePoint(int codePoint)，判斷一個code point是否SMP（Unicode中超過U+FFFF）的字符
Character.isSurrogate(char ch), 判斷一個char是否爲UTF16超過U+FFFF的兩代碼單元的字符的一個代碼單元
Character.isHighSurrogate(char ch)，判斷是否UTF16中兩單元字符的高位單元
Character.isLowSurrogate(char ch)，判斷是否UTF16中兩單元字符的低位單元
Stirng提供的length(), 這是一個比較經常使用的方法，可是它的實際含義是UTF16代碼單元的個數，也就是說若是字符串中包含了兩代碼單元的字符，那麼length的值比實際的字符個數要多
String提供的codePointCount(), 這個是返回的代碼點的個數，對於不包含兩代碼單元的字符時，其值等於length的值，包含時，其值爲字符的個數，小於length的值
StringBuilder和StringBuffer主要提供的都是string和char的append方法，可是也提供了一個能夠經過codePoint添加字符的方法 appendCodePoint(int codePoint)

下面是一個簡單的例子：

@Test
    public void testConverterOfCodePointAndChar() {
        String s = "a中\uD83D\uDD11a中";
        for (int i = 0; i < s.codePointCount(0, s.length()); i++) {
            int codePoint = s.codePointAt(i);
            log.info("code point at {}: {},\t isSupplementaryCodePoint:{}", i, codePoint, Character.isSupplementaryCodePoint(codePoint));
        }

        for (int i = 0; i < s.length(); i++) {
            char c = s.charAt(i);
            log.info("char at {}: {},\t isSurrogate:{},\t isHighSurrogate:{},\t isLowSurrogate:{}, ", i, c, Character.isSurrogate(c), Character.isHighSurrogate(c), Character.isLowSurrogate(c));
        }
    }

輸出結果爲：

code point at 0: 97,     isSupplementaryCodePoint:false
code point at 1: 20013,  isSupplementaryCodePoint:false
code point at 2: 128273,     isSupplementaryCodePoint:true
code point at 3: 56593,  isSupplementaryCodePoint:false
code point at 4: 97,     isSupplementaryCodePoint:false
char at 0: a,    isSurrogate:false,  isHighSurrogate:false,  isLowSurrogate:false
char at 1: 中,    isSurrogate:false,  isHighSurrogate:false,  isLowSurrogate:false
char at 2: ?,    isSurrogate:true,   isHighSurrogate:true,   isLowSurrogate:false
char at 3: ?,    isSurrogate:true,   isHighSurrogate:false,  isLowSurrogate:true
char at 4: a,    isSurrogate:false,  isHighSurrogate:false,  isLowSurrogate:false
char at 5: 中,    isSurrogate:false,  isHighSurrogate:false,  isLowSurrogate:false

上面的例子中咱們看到一個奇怪的現象，codePointCount獲取的字符的個數是對的，可是經過codePointAt去獲取時，遇到SMP字符不會自動計算爲兩個代碼單元，從源碼（見附錄）中能夠看到

codePointCount中是經過判斷是經過length的值減去2代碼單元的個數獲得
codePointAt 是經過判斷當前代碼單元是否UTF16高位單元，當是高位單元時會自動獲取低位單元的值，獲得完整的code point，可是獲取到低位單元時不會作處理
因此要正確的遍歷一個有2代碼單元的字符時，須要本身作處理：

@Test
    public void testIterateCodePoint() {
        String s = "a中\uD83D\uDD11a中";
        for (int i = 0; i < s.length(); i++) {
            int codePoint = s.codePointAt(i);
            log.info("code point at {}: {},\t isSupplementaryCodePoint:{}", i, codePoint, Character.isSupplementaryCodePoint(codePoint));
            if (Character.isSupplementaryCodePoint(codePoint)) i++;
        }
    }

輸出結果爲：

code point at 0: 97,     isSupplementaryCodePoint:false
code point at 1: 20013,  isSupplementaryCodePoint:false
code point at 2: 128273,     isSupplementaryCodePoint:true
code point at 4: 97,     isSupplementaryCodePoint:false
code point at 5: 20013,  isSupplementaryCodePoint:false

Java過濾4字長UTF-8編碼字符

在理解了前面的概念後，我想再過濾掉4字長的UTF-8字符已經不難了吧。
4字長的UTF-8字符就是Unicode SMP（輔助平面）中的字符, 也就是Unicode編碼大於U+FFFF的字符, 因此咱們只須要獲取字符串中各個字符的code point，當code point 大於FFFF時（或者直接使用Character.isSupplementaryCodePoint來判斷），過濾掉便可，示例代碼以下：

@Test
    public void filterUtf8mb4Test() {
        String s = "a中\uD83D\uDD11a中";
        log.info(filterUtf8mb4(s));
    }

    public static String filterUtf8mb4(String str) {
        final int LAST_BMP = 0xFFFF;
        StringBuilder sb = new StringBuilder(str.length());
        for (int i = 0; i < str.length(); i++) {
            int codePoint = str.codePointAt(i);
            if (codePoint < LAST_BMP) {
                sb.appendCodePoint(codePoint);
            } else {
                i++;
            }
        }
        return sb.toString();
    }

輸出結果爲：

a中a中

附錄

String的 codePointCount 和 codePointAt 源碼：

public int codePointCount(int beginIndex, int endIndex) {
        if (beginIndex < 0 || endIndex > value.length || beginIndex > endIndex) {
            throw new IndexOutOfBoundsException();
        }
        return Character.codePointCountImpl(value, beginIndex, endIndex - beginIndex);
    }
    
    public int codePointAt(int index) {
        if ((index < 0) || (index >= value.length)) {
            throw new StringIndexOutOfBoundsException(index);
        }
        return Character.codePointAtImpl(value, index, value.length);
    }

它們調用的Character的 codePointCountImpl 和 codePointAtImpl 的源碼：

static int codePointCountImpl(char[] a, int offset, int count) {
        int endIndex = offset + count;
        int n = count;
        for (int i = offset; i < endIndex; ) {
            if (isHighSurrogate(a[i++]) && i < endIndex &&
                isLowSurrogate(a[i])) {
                n--;
                i++;
            }
        }
        return n;
    }
    
    static int codePointAtImpl(char[] a, int index, int limit) {
        char c1 = a[index];
        if (isHighSurrogate(c1) && ++index < limit) {
            char c2 = a[index];
            if (isLowSurrogate(c2)) {
                return toCodePoint(c1, c2);
            }
        }
        return c1;
    }