How does a server determine which character encoding applies for a document it serves? Some servers examine the first few bytes of the document, or check against a database of known files and encodings. Many modern servers give Web masters more control over charset configuration than old servers do. Web masters should use these mechanisms to send out a "charset" parameter whenever possible, but should take care not to identify a document with the wrong "charset" parameter value.html
服 務器如何決定對外服務文檔的字符編碼?有一些服務器會檢查文檔的最開始幾個字節,或者檢測一組已知的文件和編碼。與那些老版本的服務器相比,不少現代的服 務器爲Web管理員提供了更多關於字符集參數的控制。只要有可能,Web管理員就應該採用這些機制來對外發送"charset"參數,但同時也要注意,不 要錯誤標識文檔的"charset"參數。express
How does a user agent know which character encoding has been used? The server should provide this information. The most straightforward way for a server to inform the user agent about the character encoding of the document is to use the "charset" parameter of the "Content-Type" header field of the HTTP protocol ([RFC2616], sections 3.4 and 14.17) For example, the following HTTP header announces that the character encoding is EUC-JP:api
用 戶代理是如何知道使用哪種字符編碼呢?相應的信息應該由服務器給出。對於服務器來講,最直接的方式就是在HTTP協議([RFC2616]3.4以及 14.17部分)中的"Content-Type"頭信息的charset參數中告知用戶代理有關文檔的字符編碼信息。例以下面的HTTP頭,聲明瞭字符 編碼爲EUC-JP:服務器
Content-Type: text/html; charset=EUC-JP
Please consult the section on conformance for the definition of text/html.app
有關text/html 的定義請參看規範符合部分。less
The HTTP protocol ([RFC2616], section 3.7.1) mentions ISO-8859-1 as a default character encoding when the "charset" parameter is absent from the "Content-Type" header field. In practice, this recommendation has proved useless because some servers don't allow a "charset" parameter to be sent, and others may not be configured to send the parameter. Therefore, user agents must not assume any default value for the "charset" parameter.ide
HTTP 協議在其[RFC2616]的3.7.1部分說起了在"Content-Type"頭信息的"charset"參數缺失時,將採用ISO-8859-1做 爲缺省的字符編碼方式。在實踐中,這個建議是沒什麼用處的。由於有些服務器是不容許發送"charset"參數的,或者還有一些服務器被設置成不發送這個 參數。因此,用戶代理絕對不能對"charset"參數的缺省值作任何假設。字體
To address server or configuration limitations, HTML documents may include explicit information about the document's character encoding; the META element can be used to provide user agents with this information.ui
爲了解決服務器自己或者配置的限制,HTML文檔內能夠包含顯式的關於文檔字符編碼方式的信息;META元素能夠用來爲用戶代理提供該類信息。this
For example, to specify that the character encoding of the current document is "EUC-JP", a document should include the following META declaration:
例如,爲了表述當前文檔的字符編碼爲「EUC-JP」,文檔內應該包含以下內容的META元素聲明:
<META http-equiv="Content-Type" content="text/html; charset=EUC-JP">
The META declaration must only be used when the character encoding is organized such that ASCII-valued bytes stand for ASCII characters (at least until the META element is parsed). META declarations should appear as early as possible in the HEAD element.
META元素聲明只能在文檔的字符編碼機制對ASCII字符的處理與ASCII字節標準一致時才能使用。至少,這種一致性在META元素以前(包括其自己)要保持。META元素在HEAD元素中出現的位置應該是越早越好。
For cases where neither the HTTP protocol nor the META element provides information about the character encoding of a document, HTML also provides the charset attribute on several elements. By combining these mechanisms, an author can greatly improve the chances that, when the user retrieves a resource, the user agent will recognize the character encoding.
對於既不採用HTTP協議方式,也不採用META元素方式提供字符編碼信息的狀況,HTML對於不少元素提供了charset屬性來指定字符編碼。這些的組合使用,在用戶檢索資源時,做者能夠極大地提升用戶代理識別文檔字符編碼的機會。
To sum up, conforming user agents must observe the following priorities when determining a document's character encoding (from highest priority to lowest):
總結一下,符合規範的用戶代理必須按照一下的優先級順序(從高到低)來決定一個文檔的字符編碼:
In addition to this list of priorities, the user agent may use heuristics and user settings. For example, many user agents use a heuristic to distinguish the various encodings used for Japanese text. Also, user agents typically have a user-definable, local default character encoding which they apply in the absence of other indicators.
做爲上述優先級列表的補充,用戶代理能夠採起啓發式的方式或者採用用戶設置的方式來決定字符編碼。例如,不少用戶代理都會用啓發的方式來決定日文文本的多種編碼方式。另外,用戶代理通常都會有一個用戶可設置的本地缺省字符編碼以應對全部的指定字符編碼的機制都缺失的狀況。
User agents may provide a mechanism that allows users to override incorrect "charset" information. However, if a user agent offers such a mechanism, it should only offer it for browsing and not for editing, to avoid the creation of Web pages marked with an incorrect "charset" parameter.
用戶代理能夠提供一種覆蓋不爭取"charset"信息的機制。然而,若是用戶代理提供這樣的機制,爲了不建立包含有不正確"charset"參數的網頁,這種機制只能用於瀏覽操做而不能用於編輯操做。
Note. If, for a specific application, it becomes necessary to refer to characters outside [ISO10646], characters should be assigned to a private zone to avoid conflicts with present or future versions of the standard. This is highly discouraged, however, for reasons of portability.
註釋。對於特定的應用,若是必定要引用到[ISO10646]字符集之外的字符,這些字符應該被放到私有區域以免與標準的如今及將來版本衝突。基於可移植性的考慮,這樣的用法是被強烈不推薦的。
A given character encoding may not be able to express all characters of the document character set. For such encodings, or when hardware or software configurations do not allow users to input some document characters directly, authors may use SGML character references. Character references are a character encoding-independent mechanism for entering any character from the document character set.
對於某些特定的字符編碼機制來講,它可能不能表示文檔字符集中的全部字符。對於這些編碼機制以及因爲軟硬件的配置限制不容許用戶直接輸入一些文檔字符的情形,做者可使用SGML字符引用。字符引用是一種獨立於字符編碼機制的能夠輸入任何文檔字符集中字符的機制。
Character references in HTML may appear in two forms:
Character references within comments have no special meaning; they are comment data only.
在HTML中,字符引用能夠有以下兩種形式:
在註釋中出現的字符引用是沒有任何特殊意義的,即它們不會被認爲是字符引用。它們只是註釋數據而已。
Note. HTML provides other ways to present character data, in particular inline p_w_picpaths.
Note. In SGML, it is possible to eliminate the final ";" after a character reference in some cases (e.g., at a line break or immediately before a tag). In other circumstances it may not be eliminated (e.g., in the middle of a word). We strongly suggest using the ";" in all cases to avoid problems with user agents that require this character to be present.
註釋。HTML提供另外展現字符數據的方法,尤爲是行內圖片
註釋。 在SGML中在某些狀況下省略字符引用最後的分號";"是能夠的(例如:在折行處或者該引用後面緊跟一個標籤時)。在其餘狀況下這個分號是不能夠省略的 (例如:在一個單詞的中間時)。咱們強烈建議在任何狀況下都要使用「;」以免有的用戶代理強制要求該字符必須出現所帶來的問題。
Numeric character references specify the code position of a character in the document character set. Numeric character references may take two forms:
數字形式的字符引用是採用直接指定字符在文檔字符集中的代碼位置的形式。數字形式字符引用能夠有以下兩種形式:
Here are some examples of numeric character references:
下面是一些數字形式字符引用的例子:
Note. Although the hexadecimal representation is not defined in [ISO8879], it is expected to be in the revision, as described in [WEBSGML]. This convention is particularly useful since character standards generally use hexadecimal representations.
註釋。雖然在[ISO10646]中沒有定義十六進制的形式,但根據[WEBSGML]的描述,這種形式將在後續的修訂版本中出現。字符標準採用十六進制表現形式將是特別有用的。
In order to give authors a more intuitive way of referring to characters in the document character set, HTML offers a set of character entity references. Character entity references use symbolic names so that authors need not remember code positions. For example, the character entity reference å refers to the lowercase "a" character topped with a ring; "å" is easier to remember than å.
爲了給做者一種更加直觀引用文檔字符集中字符的方式,HTML提供了一組字符實體引用。字符實體引用採用符號形式的名字,因此做者就沒必要再記憶字符的代碼位置。例如:字符實體引用 å表示小寫的頭上有個圓圈的字母"a"; "å" 比 å更加容易記憶。
HTML 4 does not define a character entity reference for every character in the document character set. For instance, there is no character entity reference for the Cyrillic capital letter "I". Please consult the full list of character references defined in HTML 4.
HTML並無爲文檔字符集中每個字符都定義一個相關的字符實體引用,好比就沒有爲斯拉夫大寫字符"I"提供給字符實體引用。有關HTML 4中所有的字符實體引用信息,請參閱字符引用徹底列表部分。
Character entity references are case-sensitive. Thus, Å refers to a different character (uppercase A, ring) than å (lowercase a, ring).
字符引用是大小寫敏感的。因此,Å 和 å 表示的是徹底不一樣的字符,前者表明帶圓圈的大寫A,後者表示的是帶圓圈的小寫a。
Four character entity references deserve special mention since they are frequently used to escape special characters:
有四個字符實體引用,因爲他們會被頻繁的使用,須要在此被特殊關照:
Authors wishing to put the "<" character in text should use "<" (ASCII decimal 60) to avoid possible confusion with the beginning of a tag (start tag open delimiter). Similarly, authors should use ">" (ASCII decimal 62) in text instead of ">" to avoid problems with older user agents that incorrectly perceive this as the end of a tag (tag close delimiter) when it appears in quoted attribute values.
若是HTML做者須要在文本中輸入"<"就應該使用"<" (ASCII十進制代碼60),以免與標籤其實符號衝突。相似的,HTML做者若是須要錄入">",爲了不一些老版本的用戶代理在其出如今用雙引號框定的屬性值中時將其錯誤當成標籤結束符處理,也須要用">" (ASCII 十進制編碼62)來表示。
Authors should use "&" (ASCII decimal 38) instead of "&" to avoid confusion with the beginning of a character reference (entity reference open delimiter). Authors should also use "&" in attribute values since character references are allowed within CDATA attribute values.
爲了不與字符引用開始符號衝突,HTML文檔做者在文本中應該用"&" (ASCII 十進制編碼38) 來代替"&"。因爲在屬性值中字符引用一樣會起做用,HTML做者也須要在屬性值中採用"&"來代替"&"。
Some authors use the character entity reference """ to encode instances of the double quote mark (") since that character may be used to delimit attribute values.
在HTML文檔中因爲雙引號用來框定屬性值,因此一些做者也會用字符實體引用 """來代替雙引號。
A user agent may not be able to render all characters in a document meaningfully, for instance, because the user agent lacks a suitable font, a character has a value that may not be expressed in the user agent's internal character encoding, etc.
例如,存在用戶代理缺少合適的字體或者某個字符在用戶代理的內部字符編碼中不能被表示等狀況,用戶代理可能不能正確顯示全部的字符。
Because there are many different things that may be done in such cases, this document does not prescribe any specific behavior. Depending on the implementation, undisplayable characters may also be handled by the underlying display system and not the application itself. In the absence of more sophisticated behavior, for example tailored to the needs of a particular script or language, we recommend the following behavior for user agents:
由 於在這些狀況下有不少不一樣的事情須要處理,本文檔不對任何特定的行爲作出規定。根據具體實現的不一樣,不可顯示字符能夠交給底層的現實系統出來,而不是應用 自己。因爲缺少更多成熟應對方案,例如根據某個特定的Script腳本或者語言的須要進行裁剪,咱們建議用戶代理可以根據許下方案處理: