將 NCR（Numeric Character Reference）字符轉換爲真實字符的方法

時間 2019-12-04

標籤 ncr numeric character reference 字符轉換真實方法简体版

原文原文鏈接

開發過程當中遇到一種奇怪的編碼格式:html

&#27599;&#26085;&#19968;&#33394;|&#34013;&#30333;~

使用decode/unescape/decodeURI解碼均無效.研究一番,總結一下.瀏覽器

實際上上面這種奇怪的編碼格式並非編碼,而是一種叫作 NCR(Numeric Character Reference) 的標記結構.編碼

Numeric Character Reference

看看維基百科的解釋：prototype

A numeric character reference (NCR) is a common markup construct used in SGML and other SGML-related markup languages such as HTML and XML. It consists of a short sequence of characters that, in turn, represent a single character from the Universal Charactcode

NCR是一種常見的標記結構，用於SGML和其餘SGML類似的標記語言，如HTML和XML。它由一個短的字符序列組成,表明一個字符（全球的文字字符）。htm

NCR編碼是由一個與號(&)跟着一個井號(#), 而後跟着這個字符的Unicode編碼值, 最後跟着一個分號組成的, 如:blog

&#dddd;
&#xhhhh;
&#name;

其中, dddd是字符編碼的十進制表示, 而hhhh是字符的16進製表示.ip

以 HTML 爲例，這三種轉義序列都稱做 character reference：
前兩種是 numeric character reference（NCR），數字取值爲目標字符的 Unicode code point；以「」開頭的後接十進制數字，以「」開頭的後接十六進制數字。
後一種是 character entity reference，後接預先定義的 entity 名稱，而 entity 聲明瞭自身指代的字符。
從 HTML 4 開始，NCR 以 Unicode 爲準，與文檔編碼無關。開發

「中國」二字分別是 Unicode 字符 U+4E2D 和 U+56FD，十六進制表示的 code point 數值「4E2D」和「56FD」就是十進制的「20013」和「22269」。因此——文檔

&#x4e2d;&#x56fd;
&#20013;&#22269;

——這兩種 NCR 寫法都會在顯示時轉換爲「中國」二字。

如何將 NCR 字符轉換成真實字符

方法以下:

var regex_num_set = /&#(\d+);/g;
var str = "Here is some text: &#27599;&#26085;&#19968;&#33394;|&#34013;&#30333;~"

str = str.replace(regex_num_set, function(_, $1) {
  return String.fromCharCode($1);
});

document.write('<pre>'+JSON.stringify(str,0,3));

以上例子使用了 String.prototype.replace() 和 String.fromCharCode() 方法. 思路爲將字符串中的 NCR 字符逐個獲取到 ""和";"間的 Unicode 字符編碼值, 而後利用 String.fromCharCode() 方法, 將 Unicode 編碼轉爲真實字符.

博客文章地址：http://joebon.cc/convert-numeric-chracter-reference-to-actual-character