use Encode; use strict; my $str = "中國"; Encode::_utf8_on($str); print length($str) . "\n"; Encode::_utf8_off($str); print length($str) . "\n"; 運行結果是: 程序代碼: 2 6
這裏咱們使用Encode模塊的_utf8_on函數和_utf8_off函數來開關字符串"中國"的utf8 flag. 能夠看到, utf8 flag打開的時候, "中國"被當成utf8字符串處理, 因此其長度是2. utf8 flag關閉的時候, "中國"被當成octets(字節數組)處理, 出來的長度是6(個人編輯器用的是utf8編碼, 若是你的編輯器用的是gb2312編碼, 那麼長度應該是4). 數組
# gb2312 encoding chinese use Encode; my $a = "china----中國"; my $b = "china----中國"; my $stra = decode("gb2312",$a); $stra =~ s/\W+//g; print encode("gb2312",$stra),"\n"; 輸出: china中國
encode函數顧名思義是用來編碼Perl字符串的。它將Perl字符串中的字符用指定的編碼格式編碼,最終轉化爲字節流的形式,所以和Perl處理環境以外的事物打交道常常須要它。其格式很簡單:
$octets = encode(ENCODING, $string [, CHECK]) socket
decode函數則是用來解碼字節流的。它按照你給出的編碼格式解釋給定的字節流,將其轉化爲使用utf8編碼的Perl字符串,通常來講從終端或者文件取得的文本數據都應該用decode轉換爲Perl字符串的形式。 編輯器
use Encode; use Encode::CN; #可寫可不寫 $dat="測試文本"; $str=decode("gb2312",$dat); @chars=split //,$str; foreach $char (@chars) { print encode("gb2312",$char),"/n"; }
一、查看可用編碼 函數
use Encode; #Returns a list of canonical names of available encodings that have already been loaded @list = Encode->encodings(); #get a list of all available encodings including those that have not yet been loaded @all_encodings = Encode->encodings(":all"); #give the name of a specific module @with_jp = Encode->encodings("Encode::JP"); @ebcdic = Encode->encodings("EBCDIC"); print "@list\n"; print "@all_encodings\n";
二、 測試
Character :A character in the range 0 .. 2**32-1 (or more); what Perl's strings are made of. 編碼
byte :A character in the range 0..255; a special case of a Perl character. spa
octet :8 bits of data, with ordinal values 0..255; term for bytes passed to or from a non-Perl context, such as a disk file, standard I/O stream, database, command-line argument, environment variable, socket etc. code
三、perl Encoding API:encode decode orm
#convert a string from Perl's internal format into ISO-8859-1 $octets = encode("iso-8859-1", $string);
#convert ISO-8859-1 data into a string in Perl's internal format $string = decode("iso-8859-1", $octets);四、 perl Encoding API:find_encoding
Returns the encoding object corresponding to ENCODING. Returns undef if no matching ENCODING is find. The returned object is what does the actual encoding or decoding. ci
my $enc = find_encoding("iso-8859-1"); while(<>) { my $utf8 = $enc->decode($_); ... # now do something with $utf8; }
find_encoding("latin1")->name; # iso-8859-1
五、perl Encoding API:from_to
[$length =] from_to($octets, FROM_ENC, TO_ENC [, CHECK])
from_to($octets, "iso-8859-1", "cp1250");
from_to($data, "iso-8859-1", "utf8"); #equal to $data = encode("utf8", decode("iso-8859-1", $data));