前段時間,在學習lucene的時候,遇到了讀取txt文檔遇到編碼錯誤的問題。學了幾個解決方案,大部分是將文件轉十六進制(能夠使用UE的Ctrl+H來查看),讀取開頭的四個標誌位來判斷。但是總有些文本文件沒法識別(我遇到的是部分使用UTF-8編碼的文件),後來發現了JCharDet。JCharDet是mozilla(就是firefox那家)的編碼識別算法的Java實現,算了,這裏是官網,本身看吧。java
上代碼:算法
package com.zhyea.util; import java.io.BufferedInputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.IOException; import org.mozilla.intl.chardet.nsDetector; import org.mozilla.intl.chardet.nsICharsetDetectionObserver; /** * 藉助JCharDet獲取文件字符集 * * @author robin * */ public class FileCharsetDetector { /** * 字符集名稱 */ private static String encoding; /** * 字符集是否已檢測到 */ private static boolean found; private static nsDetector detector; private static nsICharsetDetectionObserver observer; /** * 適應語言枚舉 * @author robin * */ enum Language{ Japanese(1), Chinese(2), SimplifiedChinese(3), TraditionalChinese(4), Korean(5), DontKnow(6); private int hint; Language(int hint){ this.hint = hint; } public int getHint(){ return this.hint; } } /** * 傳入一個文件(File)對象,檢查文件編碼 * * @param file * File對象實例 * @return 文件編碼,若無,則返回null * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(File file) throws FileNotFoundException, IOException { return checkEncoding(file, getNsdetector()); } /** * 獲取文件的編碼 * * @param file * File對象實例 * @param language * 語言 * @return 文件編碼 * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(File file, Language lang) throws FileNotFoundException, IOException { return checkEncoding(file, new nsDetector(lang.getHint())); } /** * 獲取文件的編碼 * * @param path * 文件路徑 * @return 文件編碼,eg:UTF-8,GBK,GB2312形式,若無,則返回null * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(String path) throws FileNotFoundException, IOException { return checkEncoding(new File(path)); } /** * 獲取文件的編碼 * * @param path * 文件路徑 * @param language * 語言 * @return * @throws FileNotFoundException * @throws IOException */ public static String checkEncoding(String path, Language lang) throws FileNotFoundException, IOException { return checkEncoding(new File(path), lang); } /** * 獲取文件的編碼 * * @param file * @param det * @return * @throws FileNotFoundException * @throws IOException */ private static String checkEncoding(File file, nsDetector detector) throws FileNotFoundException, IOException { detector.Init(getCharsetDetectionObserver()); if (isAscii(file, detector)) { encoding = "ASCII"; found = true; } if (!found) { String prob[] = detector.getProbableCharsets(); if (prob.length > 0) { encoding = prob[0]; } else { return null; } } return encoding; } /** * 檢查文件編碼類型是不是ASCII型 * @param file * 要檢查編碼的文件 * @param detector * @return * @throws IOException */ private static boolean isAscii(File file, nsDetector detector) throws IOException{ BufferedInputStream input = null; try{ input = new BufferedInputStream(new FileInputStream(file)); byte[] buffer = new byte[1024]; int hasRead; boolean done = false; boolean isAscii = true; while ((hasRead=input.read(buffer)) != -1) { if (isAscii) isAscii = detector.isAscii(buffer, hasRead); if (!isAscii && !done) done = detector.DoIt(buffer, hasRead, false); } return isAscii; }finally{ detector.DataEnd(); if(null!=input)input.close(); } } /** * nsDetector單例建立 * @return */ private static nsDetector getNsdetector(){ if(null == detector){ detector = new nsDetector(); } return detector; } /** * nsICharsetDetectionObserver 單例建立 * @return */ private static nsICharsetDetectionObserver getCharsetDetectionObserver(){ if(null==observer){ observer = new nsICharsetDetectionObserver() { public void Notify(String charset) { found = true; encoding = charset; } }; } return observer; } }
這個還存一個問題,就是識別Unicode編碼的文件,會返回windows-1252。我使用windows-1252做爲編碼的時候會報錯。windows
對了,再提供一個這個jar包下載的地址,官網有時會抽風,不能訪問。學習
下載地址:http://download.csdn.net/detail/tianxiexingyun/8286849this
就這樣。編碼