最近有一個客戶詢問使用Java中的RandomAccessFile不能向文件中正確寫入中文,出來的都是亂碼。通過分析和驗證後,發現的問題的緣由和解決辦法。下面將主要的回覆內容貼出來和你們分享:
-------------------------------------------------------------------------------
先將您昨天上午描述的問題總結以下:
使用RandomAccessFile向數據庫寫入中文的時候,
*使用write(String.getBytes()), 可以正常寫入
*使用writeBytes(String), writeChars(String), writeUTF(String)均產生亂碼。
若是我對您的問題理解正確的話,通過分析,我認爲若是您是使用RandomAccessFile來
訪問數據庫的話,爲了正確寫入中文,您最好使用write(String.getBytes())的方式。這主要有以下兩方面的緣由:
一、當java運行時,實際上存在兩種字符編碼方式。nativecode編碼和unicode編碼。
* 文件被操做系統保存時,都是以nativecode的編碼方式保存的。這也是咱們有時候在瀏覽器或電子郵件客戶端軟件中看到亂碼後,改變瀏覽器或電子郵件客戶端軟件的編碼設置(例如:GB2312,GB18030或者是BIG5)就能夠正確的顯示。
* 在JAVA程序內部,字符串都是以UNICODE的方式來表示的。
Java的內核是unicode的,其class文件也是這樣的。另外在java代碼中,string中的char是用unicode編碼方式來表示的,string的bytes是用相應的nativecode編碼方式來表示的。
因爲RandomAccessFile是同native file來打交道,因此必然存在一個nativecode和unicode的轉化過程。
二、RandomAccessFile的文件寫入方式。
在RandomAccessFile的Javadoc中,對於各類文件寫入方式有不一樣的定義。
* public void write(byte[] b):Writes b.length bytes from the specified byte array to this file, starting at the current file pointer.
* public final void writeBytes(String s) throws IOException
Writes the string to the file as a sequence of bytes. Each character in the string is written out, in sequence, by discarding its high eight bits. The write starts at the current position of the file pointer.(請注意每一個字符的高8位都會被拋棄掉。)
* public final void writeChar(int v) throws IOException
Writes a char to the file as a two-byte value, high byte first. The write starts at the current position of the file pointer.(採用的是Big-endian的存儲方式,注意因爲x86架構的限制,Windows默認採用Little-endian)
* public final void writeChars(String s) throws IOException
Writes a string to the file as a sequence of characters. Each character is written to the data output stream as if by the writeChar method. The write starts at the current position of the file pointer.(注意writeChars採用的是writeChar的寫入方式。)
* public final void writeUTF(String str) throws IOException
Writes a string to the file using modified UTF-8 encoding in a machine-independent manner.
First, two bytes are written to the file, starting at the current file pointer, as if by the writeShort method giving the number of bytes to follow. This value is the number of bytes actually written out, not the length of the string. Following the length, each character of the string is output, in sequence, using the modified UTF-8 encoding for each character. (注意writeUTF會首先寫入兩個字節,表示其後實際寫入的字節數,而後纔是對應字符串的UTF-8編碼。)
下面是我編寫的一段測試代碼,供您參考。
--------------------------------------------------------------------------------
/*
* RandomAFTest.java
*
* Created on 2005年5月8日, 下午3:38
*
*/
import java.io.*;
/**
*
* @author Paul
*/
public class RandomAFTest {
//按照指定的charset,將字符串轉換爲bytes,並打印出來
public static void printBytes(String str, String charsetName) {
try {
byte strBytes[] = str.getBytes(charsetName);
String strBytesContent = "";
for (int i = 0; i < strBytes.length; i++) {
strBytesContent = strBytesContent.concat(Integer.
toHexString(strBytes) + ",");
}
System.out.println("The Bytes of String " + str +
" within charset " + charsetName + " are: " +
strBytesContent);
} catch (UnsupportedEncodingException e) {
//Not handle;
}
}
//將字符串的chars打印出來
public static void printChars(String str) {
int strlen = str.length();
char strChars[] = new char[strlen];
str.getChars(0, strlen, strChars, 0);
String strCharsContent = "";
for (int i = 0; i < strlen; i++) {
strCharsContent = strCharsContent.concat(Integer.
toHexString(strChars) + ",");
}
System.out.println("The chars of String " + str + " are: " +
strCharsContent);
}
public static void main(String args[]) {
try {
RandomAccessFile rfWrite =
new RandomAccessFile("c:\\testWrite.dat", "rw");
RandomAccessFile rfWriteBytes =
new RandomAccessFile("c:\\testWriteBytes.dat", "rw");
RandomAccessFile rfWriteChars =
new RandomAccessFile("c:\\testWriteChars.dat", "rw");
RandomAccessFile rfWriteUTF =
new RandomAccessFile("c:\\testWriteUTF.dat", "rw");
String chStr = "中";
//打印字符串在GB2312下的bytes
printBytes(chStr, "GB2312");
//打印字符串在UTF-8下的bytes
printBytes(chStr, "UTF-8");
//打印字符串的UNICODE的chars
printChars(chStr);
try {
rfWrite.write(chStr.getBytes());
rfWrite.close();
System.out.println("Done write!");
rfWriteBytes.writeBytes(chStr);
rfWriteBytes.close();
System.out.println("Done writeBytes!");
rfWriteChars.writeChars(chStr);
rfWriteChars.close();
System.out.println("Done writeChars!");
rfWriteUTF.writeUTF(chStr);
rfWriteUTF.close();
System.out.println("Done writeUTF!");
} catch (IOException e) {
// Do not handle the IOException
}
} catch (FileNotFoundException e) {
//Do not handle
}
}
}
---------------------------------------------------------------------------
如下是該程序的部分運行結果:
The Bytes of String 中 within charset GB2312 are: ffffffd6,ffffffd0,
The Bytes of String 中 within charset UTF-8 are: ffffffe4,ffffffb8,ffffffad,
The chars of String 中 are: 4e2d,
咱們能夠看到"中"的
* GB2312編碼爲D6 D0
* UTF-8編碼爲 E4 B8 AD
* UNICODE編碼爲 4E 2D
那麼實際寫入的文件是什麼樣的呢,下面給出各個文件內容的16進制描述:
文件testWrite.dat:
D6 D0
文件testWriteBytes.dat:
2D
文件testWriteChars.dat:
4E 2D
文件testWriteUTF.dat:
00 03 E4 B8 AD
結合咱們上述的1和2,咱們不難看出:
一、String.getBytes()將會按照當前系統默認的encoding方式得到字符串的Bytes,RandomAccessFile.write(byte[])將這個byte數組正確寫入。因爲寫入的實際就是Windows平臺的nativecode編碼,因此文件還可以被正確的閱讀。
二、RandomAccessFile.writeBytes(String)將字符串的各個字符(固然是用unicode編碼的)的高8位去掉,寫入文件。
三、RandomAccessFile.writeChars(String)將字符串的各個字符按照unicode的編碼,以Big-endian的方式寫入文件。Windows平臺上默認文件的編碼方式爲Little-endian,因此用寫字板打開看到的是亂碼,可是若是咱們用瀏覽器打開這個文件(testWriteChars.dat)並指定編碼方式爲Unicode Big-endian,就能看到正常的「中」字了。
四、RandomAccessFile.writeUTF(String)首先寫入00 03表示其後將寫入3個實際的字節,而後寫入「中」的UTF-8編碼:E4 B8 AD
經過上面的分析,我建議若是使用RandomAccessFile來寫入中文的話,最好用RandomAccessFile.write(String.getBytes())的方式,若是爲了保險起見,還能夠進一步指定運行平臺的默認nativecode編碼方式,例如使用:RandomAccessFile.write(String.getBytes("gb2312"))