code point，code unit

時間 2019-11-10

標籤 code point unit 欄目 Java開源简体版

原文原文鏈接

從一段API描述談起：在String的length的API中描述是這樣的！java

length

public int length()
Returns the length of this string. The length is equal to the number of 16-bit Unicode characters in the string.

Specified by:
length in interface CharSequence

Returns:
the length of the sequence of characters represented by this object.

其中有一句話：學習

The length is equal to the number of 16-bit Unicode characters in the string.

直譯過來就是： length的大小和 16 bit 的Unicode字符的個數相同！ui

一、爲何是16bit？this

Unicode是包括目前世界上幾乎全部語言的字符集，每個字符對應的一個惟一編號，這個編號規則是：經常使用的Unicode稱謂：BMP，包含了大量的字符集，目前Unicode版本是8.0，BMP是U+0000-U+FFFF表明的字符集。固然了後期又擴展了不少。編碼

能夠看到BMP在U+0000-U+FFFF之間的字符，每個字符的Unicode編碼對應的是四個16進制，每一個16進制用四個bit表示，因此一個Unicode就是16 bit。翻譯

因此BMP內的字符都是由16Bit組成，因此有多少個16bit就有多少個字符。code

[Unicode BMP](https://en.wikipedia.org/wiki/Plane_(Unicode) Unicode和UTF-8對應關係圖片

二、String API codePoint什麼意思？ip

每個16bit的Unicode就是一個codePointci

關於code point、code unit的對應關係：

wikipedia關於code_point

三、code unit是個什麼概念？

The code unit size is equivalent to the bit measurement for the particular encoding:

A code unit in US-ASCII consists of 7 bits; A code unit in UTF-8, EBCDIC and GB18030 consists of 8 bits; A code unit in UTF-16 consists of 16 bits; A code unit in UTF-32 consists of 32 bits. 翻譯：在US-ASCII中一個code unit表明7bits 在UTF-8，EBCDIC和GB18080中一個code unit表明8bits 在UTF-16中一個code unit表明16bits 在UTF-32中一個code unit表明32bits

總結：

code point是從unicode上定義的概念，是指一個字符集好比A表明的16bits。也就是字符的個數。

好比：

String   s = "π王A23";
		//π用Unicode表明一個16bit的code point
		//王用Unicode表明一個16bit的code point
		//A用Unicode表明一個16bit的code point
		//2用Unicode表明一個16bit的code point
		//3用Unicode表明一個16bit的code point
		System.out.println("字符串s的長度爲："+s.length());
		System.out.println("第三個code point爲："+s.codePointAt(2));

輸出：

字符串s的長度爲：5
第三個code point爲：65

其中5表明5geunicode字符，每一個字符是一個16bit的unicode。 65是表明字母A的標示。是第三個字符A

關於unicode學習最好的方式就是參考Wikipedia中的講述

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。