jvm file.encoding 屬性引發的storm/hbase亂碼

時間 2019-11-12

標籤 jvm file.encoding file encoding 屬性引發 storm hbase 亂碼欄目 Java 简体版

原文原文鏈接

1. 問題html

　　今天爲storm程序添加了一個計算bolt，上線後正常，結果發現以前的另外一個bolt在將中文插入到hbase中後查詢出來亂碼。其中字符串是以UTF-8編碼的url加密串，而後我使用的URLDecoder.decode(str, "UTF-8")解碼，最後插入到hbase中。java

2. 排查linux

（1）hbase中的數據傳輸都是使用的UTF-8，所以確定不會出問題，故排除hbase端的問題；apache

（2）既然在測試的時候沒亂碼，線上卻亂碼，想到確定是線上機子jvm環境的問題；緩存

（3）肯定了是jvm環境的問題，再一想URLDecoder.decode(str, "UTF-8")這句解碼確定用的是UTF-8，若是str編碼是UTF-8的解出來固然就不會亂了，因而明確str在jvm中被用非UTF-8編碼了；多線程

（4）排查線上那臺機子的jvm默認編碼併發

　　首先，打印了 echo $LANG、echo $LC_ALL 等linux系統變量，發現都是一致的UTF-8，排除了 os 環境的問題。jvm

　　而後，重點放在了 java 環境上，使用System.getProperty("file.encoding")；打印jvm的默認編碼，結果出來的是：ISO-8859-1。函數

到這裏咱們能夠知道緣由了：因爲線上那臺機子的 jvm 參數（file.encoding）不一致致使了中文的亂碼。測試

3. 解決方案

　　知道緣由了，解決起來就簡單了，目標就是改變JVM file.encoding參數的值。

　　因爲這個參數是jvm的啓動參數，運行時不可更改（你能夠理解爲這個參數是個全局參數，並且被緩存了，若是一旦運行時更改了，可能會形成整個 jvm 裏面的程序奔潰）。

　　（1）臨時方案

　　　　jvm的啓動參數里加上-Dfile.encoding="UTF-8"來指定。

　　（2）一勞永逸方案

　　　　修改系統的charset，linux的字符集在/etc/sysconfig/i18n文件中設置，下面是個人機子默認設置：　

LANG="en_US"
SYSFONT="latarcyrheb-sun16"

　　　　有兩種修改方式：1.將/etc/sysconfig/i18n中的LANG修改成LANG="en_US.UTF-8"，修改後須要重啓機子才能生效；2.在/etc/profile中添加export LANG=en_US.UTF-8，而後source /etc/profile便可生效。

4. 疑問

　　爲什麼以前這個bolt一直正常？由於storm對bolt的分配是本身控制的（對用戶而言至關於隨機分配到不一樣的節點），以前這個bolt分配到的那個機子的jvm編碼設置的爲en_US.UTF-8，故不會出現問題。

5. 深刻理解 jvm 的 -Dfile.encoding 參數

　　上面說了這麼多，可能有同窗仍是不大明白：jvm 的這參數有啥用啊？爲啥以前都沒聽過這玩意呢？恩，沒聽過正常，以前我也沒聽過哈~

　　 （1） JVM編碼原理

　　　　jvm內部的(字節碼)編碼方式爲unicode，編碼和解碼過程爲：（1）編碼：首先將字符串使用jvm默認的編碼方式（也能夠手動指定）轉換爲unicode存儲到內存中；（2）解碼：而後就unicode編碼的字符串解碼爲用戶指定的編碼字符串。所以，只要保證編碼和解碼兩端的字符集編碼方式一致就不會出現亂碼。

　　（2）查詢源碼

　　在JDK 1.6.0_20的src.zip文件中,查找包含file.encoding字眼的文件，共找到4個：
　　（a）先上重頭戲 java.nio.Charset類:

public static Charset defaultCharset() {
        if (defaultCharset == null) {
            synchronized (Charset.class) {
                java.security.PrivilegedAction pa = new GetPropertyAction("file.encoding");
                String csn = (String) AccessController.doPrivileged(pa);
                Charset cs = lookup(csn);
                if (cs != null)
                    defaultCharset = cs;
                else
                    defaultCharset = forName("UTF-8");
            }
        }
        return defaultCharset;
    }

　　在java中，若是沒有指定charset的時候，好比new String(byte[] bytes), 都會調用Charset.defaultCharset()的方法，咱們能夠清楚的看到defaultCharset是隻能被初始化一次，這裏仍是有點小問題的，在多線程併發調用的時候仍是會初始話屢次，固然後面都是從cache（lookup的函數）裏讀出來的，問題也不大。
當咱們在改變System.getProperties裏的file.encoding 的時候，defaultCharset已經被初始化過了，因此不會在調用初始化的代碼。
當jvm 啓動的時候，load class, 最後調用main函數以前，defaultCharset已經初始化好，而不少函數裏都掉用了這個方法象String.getBytes, 還有 InputStreamReader， InputStreamWriter 都是調用了 Charset.defaultCharset()的方法。

　　（b）java.net.URLEncoder的靜態方法, 影響到的方法 java.net.URLEncoder.encode(String)

　　　　恩，這裏也須要注意，以前已經有同窗掉坑裏去了，請使用：encode(String s, String enc) 方法

　　（c）com.sun.org.apache.xml.internal.serializer.Encoding的getMimeEncoding方法(209行起)

　　（d）最後一個javax.print.DocFlavor類的靜態構造方法

　　能夠看到,系統變量file.encoding影響到
　　1. Charset.defaultCharset() Java環境中最關鍵的編碼設置
　　2. URLEncoder.encode(String) Web環境中最常遇到的編碼使用
　　3. com.sun.org.apache.xml.internal.serializer.Encoding 影響對無編碼設置的xml文件的讀取
　　4. javax.print.DocFlavor 影響打印的編碼

　　（3）Java's file.encoding property on Windows platform

This property is used for the default encoding in Java, all readers and writers would default to use this property. 「file.encoding」 is set to the default locale of Windows operationg system since Java 1.4.2. System.getProperty(「file.encoding」) can be used to access this property. Code such as System.setProperty(「file.encoding」, 「UTF-8」) can be used to change this property. However, the default encoding can not be changed dynamically even this property can be changed. So the conclusion is that the default encoding can’t be changed after JVM starts. 「java -Dfile.encoding=UTF-8」 can be used to set the default encoding when starting a JVM. I have searched for this option Java official documentation. But I can’t find it.

5. 參考文章

（1）系統變量file.encoding對Java的運行影響有多大

　　http://www.blogjava.net/ivanwan/archive/2011/01/31/343810.html

（2）linux查看系統和修改系統編碼

　　http://www.poluoluo.com/server/201401/258604.html

（3）en_US.UTF-8和zh_CN.UTF-8的區別

　　http://www.iteye.com/problems/90396