初探JDK源碼之默認字符集

時間 2019-11-13

標籤初探 jdk 源碼默認字符集欄目 Java 简体版

原文原文鏈接

本文首發於我的微信公衆號: <andyqian>，期待你的關注!

前言

今天咱們以String類中的getBytes()方法爲例，來看一看JDK源碼中的默認字符集，getBytes()方法在幫助文檔中是這樣寫的:java

Encodes this String into a sequence of bytes using the platform’s default charset, storing the result into a new byte array.shell

意思是：使用平臺的默認字符集將此 String 編碼爲 byte 序列，並將結果存儲到一個新的 byte 數組中。windows

恩，那咱們就來對不一樣平臺的默認字符集這個問題。一探究竟。數組

初探源碼

首先咱們進入getBytes()源碼中:微信

public byte[] getBytes() {
return StringCoding.encode(value, 0, value.length);
 }

該方法中直接返回StringCoding.encode(value,0,value.length)，那就再點擊進去看看。代碼以下：ssh

static byte[] encode(char[] ca, int off, int len) {
     String csn = Charset.defaultCharset().name();
     try {
         // use charset name encode() variant which provides caching.
         return encode(csn, ca, off, len);
     } catch (UnsupportedEncodingException x) {
         warnUnsupportedCharset(csn);
     }
     try {
         return encode("ISO-8859-1", ca, off, len);
     } catch (UnsupportedEncodingException x) {
         // If this code is hit during VM initialization, MessageUtils is
         // the only way we will be able to get any kind of error message.
         MessageUtils.err("ISO-8859-1 charset not available: "
                          + x.toString());
         // If we can not find ISO-8859-1 (a required encoding) then things
         // are seriously wrong with the installation.
         System.exit(1);
         return null;
     }
 }

在這裏，咱們看到了。在上述方法中，經過 Charset.defaultCharset().getName() 獲取系統默認的字符集。那咱們就再點擊進去看看，代碼以下：ide

public static Charset defaultCharset() {
     if (defaultCharset == null) {
         synchronized (Charset.class) {
             String csn = AccessController.doPrivileged(
                 new GetPropertyAction("file.encoding"));
             Charset cs = lookup(csn);
             if (cs != null)
                 defaultCharset = cs;
             else
                 defaultCharset = forName("UTF-8");
         }
     }
     return defaultCharset;
 }

其實，在上述代碼中，咱們最關心的是這一行代碼：測試

String csn = AccessController.doPrivileged(
                 new GetPropertyAction("file.encoding"));

點擊進去後。以下所示：ui

public class GetPropertyAction implements PrivilegedAction<String> {
 private String theProp;
 private String defaultVal;

 public GetPropertyAction(String var1) {
     this.theProp = var1;
 }

 public GetPropertyAction(String var1, String var2) {
     this.theProp = var1;
     this.defaultVal = var2;
 }

 public String run() {
     String var1 = System.getProperty(this.theProp);
     return var1 == null?this.defaultVal:var1;
 }

在這裏，咱們已經看到了熟悉的代碼:this

System.getProperty(this.theProp);

到此，咱們就能夠在不一樣的平臺作實驗了。
這次實驗的平臺有:

Linux平臺
系統: Ubuntu 14.04 LTS (中文環境)
Windows平臺
系統: Windows 7 (中文環境)

不一樣平臺的默認字符集實驗

測試JDK版本: java version 「1.7.0_79」
步驟:
windows編碼

D:\testpdfPath>javac Test.java

D:\testpdfPath>java Test
GBK

Linux (會話編碼爲UTF-8)編碼(默認爲UTF-8):

[andy@andyqian  /tmp]
$ javac Test.java

[andy@andyqian  /tmp]
$ java Test
UTF-8

如圖所示:

Linux (會話編碼爲GBK)編碼：

[andy@andyqian  /tmp]
$ ls
Test.java

[andy@andyqian /tmp]
$ javac Test.java

[andy@andyqian  /tmp]
$ java Test
GBK

以上試驗代表：
Windows中文環境下，默認編碼爲：GBK。
Linux系統中文環境下，默認編碼爲： UTF-8。

不一樣的系統 file.encoding 的表現是不同的。到此，咱們已經查看了getBytes()中的默認字符集源碼。

實驗代碼:

上述試驗代碼，很是簡單，以下所述，有興趣試驗的朋友，能夠新建一個Java類，命名爲Test.java，複製到其中便可。

/**
 * author: andy
 * date: 17-11-24
 * blog: www.andyqian.com
 * version: 0.0.1
 * description:
 */
public class Test {

    public static void main(String[] args){
        System.out.println(System.getProperty("file.encoding"));
    }
}

這裏須要注意的是: 直接複製到IDEA中，獲取的結果可能會受idea影響。這也我直接使用原始命令來編譯的緣由。

SSH 遠程編碼:

這裏說個題外話，SSH本地機器編碼會影響遠程機器當前會話的編碼。怎麼說呢? 咱們繼續作實驗。

機器準備:
一臺編碼爲 en_US.UTF-8 編碼的機器。
一臺編碼爲 zh_GBK 編碼的機器。

(備註: 兩臺相同編碼的機器也能夠，修改一臺機器的編碼便可。)

首先，咱們經過Xshell直接來鏈接遠程機器(UTF-8)經過locale命令查看系統編碼以下：

[andy@andyqian  /tmp/test]
$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

注意此時的系統編碼爲： UTF-8

1. 緊接着，咱們先查看本地機器(GBK)的編碼,

[andy@andyqian02  /home/andyqian]
$ locale
LANG=zh_CN.GBK
LC_CTYPE="zh_CN.GBK"
LC_NUMERIC="zh_CN.GBK"
LC_TIME="zh_CN.GBK"
LC_COLLATE="zh_CN.GBK"
LC_MONETARY="zh_CN.GBK"
LC_MESSAGES="zh_CN.GBK"
LC_PAPER="zh_CN.GBK"
LC_NAME="zh_CN.GBK"
LC_ADDRESS="zh_CN.GBK"
LC_TELEPHONE="zh_CN.GBK"
LC_MEASUREMENT="zh_CN.GBK"
LC_IDENTIFICATION="zh_CN.GBK"
LC_ALL=

再經過本地主機(GBK)，ssh鏈接到遠程主機機器(UTF-8)上，使用locale命令查看系統編碼以下:

[andy@andyqian02  /home/andyqian]
$ ssh andyqian@192.168.1.1 
andyqian@192.168.1.1's password:

登陸後，查看編碼：

[andy@andyqian01  /home/andyqian]
$ locale
LANG=zh_CN.GBK
LC_CTYPE="zh_CN.GBK"
LC_NUMERIC="zh_CN.GBK"
LC_TIME="zh_CN.GBK"
LC_COLLATE="zh_CN.GBK"
LC_MONETARY="zh_CN.GBK"
LC_MESSAGES="zh_CN.GBK"
LC_PAPER="zh_CN.GBK"
LC_NAME="zh_CN.GBK"
LC_ADDRESS="zh_CN.GBK"
LC_TELEPHONE="zh_CN.GBK"
LC_MEASUREMENT="zh_CN.GBK"
LC_IDENTIFICATION="zh_CN.GBK"
LC_ALL=

（備註: 192.168.1.1 請替換成本身的主機地址）

注意此時遠程主機當前會話的編碼已經變成了

zh_CN.GBK

此時經過：