Python: 在Unicode和普通字符串之間轉換

時間 2019-11-10

標籤 python unicode 普通字符串之間轉換欄目 Python 简体版

原文原文鏈接

1.1. 問題 Problem

You need to deal with data that doesn't fit in the ASCII character set.網絡

你須要處理不適合用ASCII字符集表示的數據.app

1.2. 解決 Solution

Unicode strings can be encoded in plain strings in a variety of ways, according to whichever encoding you choose:less

Unicode字符串能夠用多種方式編碼爲普通字符串, 依照你所選擇的編碼(encoding):dom

Toggle line numberssocket

1 #將Unicode轉換成普通的Python字符串:"編碼(encode)"
2 unicodestring = u"Hello world"
3 utf8string = unicodestring.encode("utf-8")
4 asciistring = unicodestring.encode("ascii")
5 isostring = unicodestring.encode("ISO-8859-1")
6 utf16string = unicodestring.encode("utf-16")
7 
8 
9 #將普通的Python字符串轉換成Unicode: "解碼(decode)"
10 plainstring1 = unicode(utf8string, "utf-8")
11 plainstring2 = unicode(asciistring, "ascii")
12 plainstring3 = unicode(isostring, "ISO-8859-1")
13 plainstring4 = unicode(utf16string, "utf-16")
14 
15 assert plainstring1==plainstring2==plainstring3==plainstring4

1.3. 討論 Discussion

If you find yourself dealing with text that contains non-ASCII characters, you have to learn about Unicode梬hat it is, how it works, and how Python uses it.ide

若是你發現本身正在處理包含非ASCII碼字符的文本, 你必須學習Unicode，關於它是什麼,如何工做，並且Python如何使用它。函數

Unicode is a big topic.Luckily, you don't need to know everything about Unicode to be able to solve real-world problems with it: a few basic bits of knowledge are enough.First, you must understand the difference between bytes and characters.In older, ASCII-centric languages and environments, bytes and characters are treated as the same thing.Since a byte can hold up to 256 values, these environments are limited to 256 characters.Unicode, on the other hand, has tens of thousands of characters.That means that each Unicode character takes more than one byte, so you need to make the distinction between characters and bytes.工具

Unicode是一個大的主題。幸運地，你並不須要知道關於Unicode碼的每件事，就可以用它解決真實世界的問題: 一些基本知識就夠了。首先，你得了解在字節和字符之間的不一樣。原先，在以ASCII碼爲中心的語言和環境中，字節和字符被當作相同的事物。因爲一個字節只能有256個值，這些環境就受限爲只支持 256個字符。Unicode碼，另外一方面，有數萬個字符,那意謂着每一個Unicode字符佔用多個字節，所以，你須要在字符和字節之間做出區別。學習

Standard Python strings are really byte strings, and a Python character is really a byte.Other terms for the standard Python type are "8-bit string" and "plain string.",In this recipe we will call them byte strings, to remind you of their byte-orientedness.this

標準的Python字符串確實是字節字符串，並且一個Python字符真的是一個字節。換個術語,標準的 Python字符串類型的是 "8位字符串(8-bit string)"和"普通字符串(plain string)". 在這一份配方中咱們把它們稱做是字節串(byte strings), 並記住它們是基於字節的。

Conversely, a Python Unicode character is an abstract object big enough to hold the character, analogous to Python's long integers.You don't have to worry about the internal representation;the representation of Unicode characters becomes an issue only when you are trying to send them to some byte-oriented function, such as the write method for files or the send method for network sockets.At that point, you must choose how to represent the characters as bytes.Converting from Unicode to a byte string is called encoding the string.Similarly, when you load Unicode strings from a file, socket, or other byte-oriented object, you need to decode the strings from bytes to characters.

相反地，一個Python Unicode碼字符是一個大到足夠支持(Unicode)字符的一個抽象對象, 相似於Python中的長整數。你沒必要要爲內在的表示擔心; 只有當你正在嘗試把它們傳遞給給一些基於字節的函數的時候，Unicode字符的表示變成一個議題, 好比文件的write方法或網絡套接字的send 方法。那時，你必需要選擇該如何表示這些(Unicode)字符爲字節。從Unicode碼到字節串的轉換被叫作編碼。一樣地，當你從文件，套接字或其餘的基於字節的對象中裝入一個Unicode字符串的時候,你須要把字節串解碼爲(Unicode)字符。

There are many ways of converting Unicode objects to byte strings, each of which is called an encoding.For a variety of historical, political, and technical reasons, there is no one "right" encoding.Every encoding has a case-insensitive name, and that name is passed to the decode method as a parameter. Here are a few you should know about:

將Unicode碼對象轉換成字節串有許多方法, 每一個被稱爲一個編碼(encoding)。因爲多種歷史的，政治上的，和技術上的緣由,沒有一個 "正確的"編碼。每一個編碼有一個大小寫無關的名字，並且那一個名字被做爲一個叄數傳給解碼方法。這裏是一些你應該知道的:

The UTF-8 encoding can handle any Unicode character.It is also backward compatible with ASCII, so a pure ASCII file can also be considered a UTF-8 file, and a UTF-8 file that happens to use only ASCII characters is identical to an ASCII file with the same characters.This property makes UTF-8 very backward-compatible, especially with older Unix tools.UTF-8 is far and away the dominant encoding on Unix.It's primary weakness is that it is fairly inefficient for Eastern texts.
UTF-8 編碼能處理任何的Unicode字符。它也是與ASCII碼向後兼容的，所以一個純粹的ASCII碼文件也能被考慮爲一個UTF-8 文件，並且一個碰巧只使用ASCII碼字符的 UTF-8 文件和擁有一樣字符的ASCII碼文件是相同的。這個特性使得UTF-8的向後兼容性很是好,尤爲使用較舊的 Unix工具時。UTF-8 無疑地是在 Unix 上的佔優點的編碼。它主要的弱點是對東方文字是很是低效的。
The UTF-16 encoding is favored by Microsoft operating systems and the Java environment.It is less efficient for Western languages but more efficient for Eastern ones.A variant of UTF-16 is sometimes known as UCS-2.
UTF-16 編碼在微軟的操做系統和Java環境下受到偏心。它對西方語言是比較低效,但對於東方語言是更有效率的。一個 UTF-16 的變體有時叫做UCS-2 。
The ISO-8859 series of encodings are 256-character ASCII supersets.They cannot support all of the Unicode characters;they can support only some particular language or family of languages.ISO-8859-1, also known as Latin-1, covers most Western European and African languages, but not Arabic.ISO-8859-2, also known as Latin-2,covers many Eastern European languages such as Hungarian and Polish.
ISO-8859編碼系列是256個字符的ASCII碼的超集。他們不可以支援全部的Unicode碼字符; 他們只能支援一些特別的語言或語言家族。ISO-8859-1, 也既Latin-1,包括大多數的西歐和非洲語言, 可是不含阿拉伯語。ISO-8859-2,也既Latin-2,包括許多東歐的語言,像是匈牙利語和波蘭語。

If you want to be able to encode all Unicode characters, you probably want to use UTF-8.You will probably need to deal with the other encodings only when you are handed data in those encodings created by some other application.

若是你想要可以編碼全部的Unicode碼字符,你或許想要使用UTF-8。只有當你須要處理那些由其餘應用產生的其它編碼的數據時，你或許才須要處理其餘編碼。

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。