有關emoji表情以及utf-16編碼

時間 2019-12-01

原文原文鏈接

昨日IOS組的同事遇到一個棘手的問題：當輸入框內含有emoji表情時，如何獲取文本框內的字符數（一個emoji表情算一個字符）。

先從我最近接觸的JAVA提及，JAVA中，在使用String的length方法時，若是是普通的中英文字符，沒有問題，可是若是該字符的Unicode編碼大於0xFFFF，這個length方法就不能正確的獲取字符數量了，事實上會把這樣的特殊字符計算成2個字符。固然，JAVA已有現成的方法解決這個問題：codePointCount。

惋惜的是，找了好久，在Objective-c中沒有找到相似的方案。（彷佛SubString後，數組長度就是準確的字符數，有待驗證）

我不是IOS程序員，暫時不能提供OC中的解決方案。但在昨日的摸索中，也有一點點收穫，拿出來分享一下。

1. emoji表情大部分的unicode編碼大於0xFFFF，也就是UTF16編碼後佔用4個字節，僅小部分表情Unicode小於0xFFFF，這部分UTF16編碼後佔用2個字節。

2. 無論是Android仍是IOS，從文本框中讀取到的字符串，在內存中都是UTF-16編碼(大端)形式存放的。（默認狀況下）

3. 順便摘錄utf-16編碼的規則（看明白這個規則，IOS中自行解決code point count的問題也就迎刃而解了）：

   1) If U < 0x10000, encode U as a 16-bit unsigned integer and
      terminate.

   2) Let U' = U - 0x10000. Because U is less than or equal to 0x10FFFF,
      U' must be less than or equal to 0xFFFFF. That is, U' can be
      represented in 20 bits.

   3) Initialize two 16-bit unsigned integers, W1 and W2, to 0xD800 and
      0xDC00, respectively. These integers each have 10 bits free to
      encode the character value, for a total of 20 bits.

   4) Assign the 10 high-order bits of the 20-bit U' to the 10 low-order
      bits of W1 and the 10 low-order bits of U' to the 10 low-order
      bits of W2. Terminate.

   Graphically, steps 2 through 4 look like:
   U' = yyyyyyyyyyxxxxxxxxxx
   W1 = 110110yyyyyyyyyy
   W2 = 110111xxxxxxxxxx

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。