Unicode、UTF-八、UTF-16之間的關係

時間 2019-11-07

原文原文鏈接

一、爲何須要Unicode 在很早之前全部，在計算機的世界裏只有ASCII，後來多了一些控制字符、標點等，最後就是今天的世界裏你可以看到不少種語言在一個文檔中，例如：English, العربية, 漢語, עִבְרִית, ελληνικά, and ភាសាខ្មែរ ，後期或許會出現更多的其餘語言的字符，計算機中須要顯示全部的這些語言的字符。所以：一個包容全部語言字符的字符集頗有必要，這就是Unicode的誕生的意義。ios

二、Unicode簡介 Unicode是一個包含世界上全部語言字符的字符集，它爲世界上每個字符分配一個惟一的數字，官方術語叫 code point（碼位）。Unicode的一個很大的優勢是，碼位的前256位和ISO-8859-1以及ASCII同樣。大部分經常使用的字符經過一到兩個字節就能夠表示。less

三、爲何須要UTF-8或者UTF-16等編碼 雖然Unicode可以包容全部的字符集，可是咱們直接看Unicode碼很不方便，像看天書同樣，咱們對咱們經常使用的文字最熟悉，因此就須要把咱們經常使用的可讀性強的文字和Unicode字符集一一對應。這個過程叫編碼。經常使用的UTF-八、GBK、UTF-16等都是不一樣的編碼方式，這些都是把咱們看到的文字和Unicode字符集對應起來的規則。ide

四、UTF-8和UTF-16之間的區別this

一、基於內存考慮的比較：編碼

UTF-8: 1 byte: Standard ASCII 2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian) 3 bytes: BMP 4 bytes: All Unicode characterscode

UTF-16: 2 bytes: BMP 4 bytes: All Unicode characterscomponent

實例： UTF-8編碼： 00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)orm

UTF-16編碼： 00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)three

五、UTF-8和UTF-16的優缺點比較 UTF-8和UTF-16都是基於可變長度的編碼方式。UTF-8最小是8 bit，UTF-16最少是16 bit。ip

UTF-8優勢： 1.兼容基本的ASCII和US-ASCII. 2.No null bytes, which allows to use null-terminated strings, this introduces a great deal of backwards compatibility too. 3.UTF-8 is independent of byte order, so you don't have to worry about Big Endian / Little Endian issue.

UTF-8缺點：

1.Many common characters have different length, which slows indexing by codepoint and calculating a codepoint count terribly. 2.Even though byte order doesn't matter, sometimes UTF-8 still has BOM (byte order mark) which serves to notify that the text is encoded in UTF-8, and also breaks compatibility with ASCII software even if the text only contains ASCII characters. Microsoft software (like Notepad) especially likes to add BOM to UTF-8.

UTF-16優勢 1.BMP (basic multilingual plane) characters, including Latin, Cyrillic, most Chinese (the PRC made support for some codepoints outside BMP mandatory), most Japanese can be represented with 2 bytes. This speeds up indexing and calculating codepoint count in case the text does not contain supplementary characters. 2.Even if the text has supplementary characters, they are still represented by pairs of 16-bit values, which means that the total length is still divisible by two and allows to use 16-bit char as the primitive component of the string.

UTF-16缺點 1.Lots of null bytes in US-ASCII strings, which means no null-terminated strings and a lot of wasted memory. 2.Using it as a fixed-length encoding 「mostly works」 in many common scenarios (especially in US / EU / countries with Cyrillic alphabets / Israel / Arab countries / Iran and many others), often leading to broken support where it doesn't. This means the programmers have to be aware of surrogate pairs and handle them properly in cases where it matters! 3.It's variable length, so counting or indexing codepoints is costly, though less than UTF-8.

In general, UTF-16 is usually better for in-memory representation while UTF-8 is extremely good for text files and network protocols.

實例參考：

"A" in ASCII is hex 0x41; in UTF-8 it is also 0x41; in UTF-16 it is 0x0041 "À" in Latin-1 is 0xC0; in UTF-8 it is 0xC3 0x80; in UTF-16 it is 0x00C0, The Tibetan letter ཨ in UTF-8 is 0xE0 0xBD 0xA8; it UTF-16 it is 0x0F68, This character*: http://www.fileformat.info/info/... in UTF-8 is 0xF0 0xA0 0x80 0x8B; in UTF-16 it is 0xD840 0xDC0B