UTF-8是如何編碼的？

時間 2019-11-21

原文原文鏈接

衆所周知計算機上存儲的是二進制0和1，string字符串是如何轉變爲二進制0和1的呢？java

每個字符都會轉換爲對應的16進制，16進制也是一堆01代碼，就至關於存儲在計算機上的01代碼。git

不一樣的字符集經過不一樣的編碼方式存儲不一樣數目的字節數。下面以UTF-8是如何編碼存儲字符爲二進制的爲例子進行說明：github

String a = 「A」

a.getBytes().length is 1

byte array is [65]


String a = "ë"

a.getBytes().length is 2

byte array is [-61, -85]

如上所示： A字符佔用一個字節 ë字符佔用兩個字節。web

etBytes()假設默認編碼方式爲UTF-8。ui

一些字符是一個字節，一些字符是兩個字節，或者更多的字節，那麼如何進行解碼呢？this

UTF-8如何進行編碼？ 在Wikipedia中給出了相關的規則：編碼

if the first byte starts with 0 then it is a single byte char翻譯

if the first byte starts with 110 then it is 2 bytescode

if the first byte starts with 1110 then it is 3 bytes圖片

if the first byte starts with 11110 then it is 4 bytes

if the first byte starts with 111110 then it is 5 byte

if the first byte starts with 1111110 then it is 6 byte

翻譯：若是第一個字節以0開始，表明是一個單字節字符。若是第一個字節以110開始，表明是雙字節字符。若是第一個字節以1110開始，表明是三字節字符。若是第一個字節以11110開始，表明是四字節字符。若是第一個字節以111110開始，表明是五字節字符。若是第一個字節以1111110開始，表明是六字節字符。

因此咱們解碼就是反推便可： if the first byte starts with 0 then it is a single byte char so it decodes only that byte

if the first byte starts with 110 then it is 2 byte so it decodes 2 consecutive bytes

if the first byte starts with 1110 then it is 3 byte so it decodes 3 consecutive bytes

if the first byte starts with 11110 then it is 4 byte so it decodes 4 consecutive bytes

if the first byte starts with 111110 then it is 5 byte so it decodes 5 consecutive bytes

if the first byte starts with 1111110 then it is 6 byte so it decodes 6 consecutive bytes

下面用表格的方式列出Unicode和16進制以及佔用字節之間的關係：

實例實戰

110 xxxxx 10 xxxxxx

110 00011 10 101011

00011       101011  → binary equivalent of hex pointing to ë

ɟ 110 xxxxx 10 xxxxxx

110 01001 10 011111

01001     011111   → binary equivalent of hex pointing to ɟ

11100000 10101101 10011111如何解碼？ 1110表明是三個字節爲一個字符： 1110xxxx 10xxxxxx 10xxxxxx

11100000 10101101 10011111

so 0000 101101 011111 is the binary to be decoded.

因此爲 0000 101101 011111 每四位爲： 0000 1011 0101 1111 爲：B5F

The binary is B5F in hexadecimal (If you don't know to convert use this binary to hex converter website ) Now from map B5F means ୟ .

練習：對01000010 01000001 11000011 10110000 11100010 10001011 10110011進行解碼

一、第一個字符 01000010 爲一個字符： 0100 0010爲：42 參考這裏對應字符B

二、第二個字符

01000001 爲一個字符： 0100 0001爲：41 參考表格對應字符A

三、第三個字符 11000011 10110000 爲一個字符： 0000 11 110000 就是F0，參考表格映射爲字符：ð

四、第四個字符： 11100010 10001011 10110011 爲一個字符： 00010 001011 110011 就是 22F3 參考表格映射爲字符：⋳

結論就是：01000010 01000001 11000011 10110000 11100010 10001011 10110011 採用UTF-8編碼爲BAð⋳

String 的 getBytes("UTF-8")作了什麼操做呢？

String s = "ABCDEF⋳";

ABCDEF⋳經過getBytes("UTF-8")被編碼爲UTF-8格式，它是如何存儲的呢？ A - 01000001

B - 01000010

C - 01000011

D - 01000100

E - 01000101

F - 01000110

⋳ - 11100010 10001011 10110011

注意：以上是以字節的形式存儲在內存中

因此getBytes("UTF-8")是獲取每個字節返回。

在內存中是如何存儲的呢？

01000001 表明正數 65 可是11100010 表明負數 -31

因此存儲在內存中爲： 01000001 - 65

01000010 - 66

01000011 - 67

01000100 - 68

01000101 - 69

01000110 - 70

11100010 - -31

10001011 - -117

10110011 - -77

代碼爲證：

String s = "ABCDEF⋳";
        	 byte[]  bs = s.getBytes("UTF-8");
        	 for(byte  b : bs)
        	 System.out.print(b+",");

輸出：

65,66,67,68,69,70,-30,-117,-77

Reference1 Reference2

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。