你真的會字符串反轉、計算字符串長度麼？

時間 2019-11-10

標籤字符串反轉計算長度简体版

原文原文鏈接

你真的會字符串反轉、計算字符串長度麼？

Javascript 字符串編碼

問題

一個常見的問題：如何將字符串反轉?javascript

一個常見的解答：html

'abcd'.split('').reverse().join('') // dcba

再如，如何獲得一個字符串的長度？java

答：git

'abcd'.length // 4

這些答案都不是徹底正確，或者說並非對於全部的字符都是適用的，例如：github

'a?bc'.split('').reverse().join('') \\ cb��a
'a?bc'.lenght // 5

'aãbc'.split('').reverse().join('') \\ cb̃aa
'aãbc'.length // 5

這其中的緣由涉及到了 Javascript 的字符串編碼。學習

Unicode 及編碼

Unicode 是一套包含了人類全部的字符、編碼、展現的標準。字體

Unicode 對於每個字符（character）給了惟一的數字標示，稱爲「代碼點」（code point）。也就是說 Unicode 利用一個抽象的數字，即 code point 來表明字符。Unicode 定義了 1,114,112 個 code point，十六進制爲 0 到 10FFFF，通常的表示方式爲「U+」開頭，後面接十六進制表示的 code point，例如：「A」的 code point 爲 U+0041。¹編碼

在實際的使用、傳輸 Unicode 中爲了減小數據大小等需求，通常會將 code point 編碼（encoding）。通常的 encoding 方式爲「UCS-2」、「UTF-16」、「UTF-8」。es5

UCS-2：用 16 bit 來表示 code point。如今 code point 的範圍已經超越了 16 bit 能夠表示的了。
UTF-16：對於可使用 16 bit 範圍內的 code point，就與 UCS-2 相同；不然：prototype
- code point 減 0x010000
- 結果前 10 bit 加 0xD800，後 10 bit 加 0xDC00
這樣就會獲得兩個 16 bit 的結果，範圍分別爲：0xD800 - 0xDBFF，和 0xDC00 - 0xDFFF，這兩個值就表明了相應的 code point，通常稱這兩個值爲「surrogate pairs」。

Unicode 標準保證了全部的 code point 均可以用 UTF-16 表示。
UTF-8：
- code point 小於 0x7F，則編碼爲其自己。
- code point 大於 0x7F 小於 0x7FF，編碼爲 110+code point 前五位，10+code point 剩下的。
- code point 大於 0x7FF 小於 0xFFFF，編碼爲 1110+code point 前四位，10+code point 剩下的。
- 剩下的 code point 編碼爲 11110+code point 前三位，10+code point 剩下的六位。

術語

Unicode 中有不少概念須要釐清，和本文關係不大，可是對於更好的理解編碼、或者後續的更深刻的學習也是有好處的。

character：

The smallest component of written language that has semantic value; refers to the abstract meaning and/or shape, rather than a specific shape (see also glyph), though in code tables some form of visual representation is essential for the reader’s understanding. 。

grapheme：

A minimally distinctive unit of writing in the context of a particular writing system

例如，英語中的 <b> 和 <d>，就是兩種不一樣的grapheme；<a> 和 <ɑ> 就是同一個 grapheme，是字母 a 不一樣表示。

一個 grapheme 能夠用一個或多個 code point 表示，例如「ç」的 code point 爲 U+0063 U+0327

String.fromCodePoint(0x0063, 0x0327); // ç

多個 grapheme 也可能只有一個 code point 表示，例如「ﷺ」的 code point 爲 U+FDFA，可是「ﷺ」是有多個 grapheme 組成的。

Sting.fromCodePoint(0xFDFA); // ﷺ

glyph：對於 grapheme 的可視化的表示。

能夠看出，咱們通常理解中，「字符」都是爲「grapheme」；「字體」、「字號」等都是「glyph」。

緣由

ECMAScript 對於字符的編碼方式並無嚴格的約定，可是大部分引擎的實現都是 UTF-16，可是，Javascript 對於一個字符的定義（注意和 Unicode 中「character」的區別）：

the word 「character」 will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text ²

，不嚴格的說字符串就是一個個 16 bit 字符組成的串（從這個角度來講又和 UCS-2 很類似），也稱爲（code units）。

'a?bc'[0] // a
'a?bc'[1] // �
'a?bc'[2] // �
'a?bc'[3] // b
'a?bc'[4] // c
    
'aãbc'[0] // a
'aãbc'[1] // a
'aãbc'[2] //  ̃
'aãbc'[3] // b
'aãbc'[4] // c

「?」的 code point 長度大於 16 bit 的使用 UTF-16 的「surrogate pairs」即，兩個 16 bit 來表示，但同時，內部的不少處理都是按照字符（16 bit）, 例如：

'a?bc'.length === 5

因此就產生了上面字符串反轉的問題：

String.fromCodePoint(0xD83D, 0xDCA9) \\ ?

將 0xD83D 0xDCA9 反轉爲 0xDCA9 0xD83D 致使錯誤的字符串。

「ã」則是由字符「a」和一個 combining marks 「 ̃」組合成的一個字符：

String.fromCodePoint(0x0061, 0x0303) \\ ã

相似的將其按照 16 bit 反轉後就會有問題。

解答

根據 UTF-16 對於「surrogate pairs」的定義和「combining marks」的 code point 位置，咱們能夠本身處理字符串反轉的問題，

以「surrogate pairs」爲例：

const regexSurrogatePair = /([\uD800-\uDBFF])([\uDC00-\uDFFF])/g

const reverse = (string) => {
  return string.replace(regexSurrogatePair, ($0, $1, $2) => {
    return $2 + $1 // 先將「surrogate pairs」反轉
  }).split('').reverse().join('')
}
    
reverse('a?bc') // cb?a

更全面的庫 esrever。

而對於「長度」問題：

[...'a?bc'].length // 4

或

let count = 0

for (let codePoint of 'a?bc') {
  count++
}

count // 4

由於String.prototype[@@iterator]()是遍歷的 code point。

總結

Javascript 字符串對外並無暴露 code point ，而是以 16 bit 爲單位（UCS-2）提供，致使了 code point 長度大於 16 bit 的字符（non-BMP）在某些操做上會有問題（反轉、取長度），因此在對於這種字符就須要特別處理。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。