openoffice將txt文本轉pdf中文亂碼

時間 2020-03-11

標籤 openoffice txt 文本 pdf 中文亂碼简体版

原文原文鏈接

問題描述：java

使用openoffice將txt文本轉pdf的過程當中發現中文亂碼。測試

解決思路及過程：編碼

一、查看出現亂碼的緣由spa

經查詢jodconverter源碼發現，只有utf-8編碼的文本纔不會中文亂碼。code

二、怎麼樣將非utf-8編碼文件轉換成utf-8文件。utf-8

要轉以前首先要判斷txt文本自己的編碼。經查發現txt文本有一個頭。ci

判斷方法以下unicode

/**
     * 根據文件路徑返回文件編碼
     * @param filePath
     * @return
     * @throws IOException
     */
    public static String getCharset(String filePath) throws IOException{  
        BufferedInputStream bin = new BufferedInputStream(new FileInputStream(
                filePath));
        int p = (bin.read() << 8) + bin.read();
        String code = null;
 
        switch (p) {
        case 0xefbb:
            code = "UTF-8";
            break;
        case 0xfffe:
            code = "Unicode";
            break;
        case 0xfeff:
            code = "UTF-16";
            break;
        default:
            code = "GB2312";
        }
      System.out.println(code);
        return code;  
  }

轉換代碼以下get

/** 
     * 以指定編碼方式寫文本文件，存在會覆蓋 
     *  
     * @param file 
     *            要寫入的文件 
     * @param toCharsetName 
     *            要轉換的編碼 
     * @param content 
     *            文件內容 
     * @throws Exception 
     */  
    public static void saveFile2Charset(File file, String toCharsetName,  
            String content) throws Exception {  
        if (!Charset.isSupported(toCharsetName)) {  
            throw new UnsupportedCharsetException(toCharsetName);  
        }  
        OutputStream outputStream = new FileOutputStream(file);  
        
        OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,  
                toCharsetName);  
        outWrite.write(content);  
        outWrite.close();  
    }

經測試發現，轉換後的文本，獲取的頭仍是gbk的，只有手機將頭文件中blob生成源碼

代碼以下：

/** 
     * 以指定編碼方式寫文本文件，存在會覆蓋 
     *  
     * @param file 
     *            要寫入的文件 
     * @param toCharsetName 
     *            要轉換的編碼 
     * @param content 
     *            文件內容 
     * @throws Exception 
     */  
    public static void saveFile2Charset(File file, String toCharsetName,  
            String content) throws Exception {  
        if (!Charset.isSupported(toCharsetName)) {  
            throw new UnsupportedCharsetException(toCharsetName);  
        }  
        OutputStream outputStream = new FileOutputStream(file);  
        //增長頭文件標識
        outputStream.write(new byte[]{(byte)0xEF, (byte)0xBB, (byte)0xBF});  
        OutputStreamWriter outWrite = new OutputStreamWriter(outputStream,  
                toCharsetName);  
        outWrite.write(content);  
        outWrite.close();  
    }

經測試

GB2312
Unicode
UTF-16
UTF-8
都成功。

txt編碼和頭文件說明

java編碼與txt編碼對應
java	txt
unicode	unicode big endian
utf-8	utf-8
utf-16	unicode
gb2312	ANSI

什麼是BOM

BOM（byte-order mark），即字節順序標記，它是插入到以UTF-八、UTF16或UTF-32編碼Unicode文件開頭的特殊標記，用來識別Unicode文件的編碼類型。對於UTF-8來講，BOM並非必須的，由於BOM用來標記多字節編碼文件的編碼類型和字節順序（big-endian或little- endian）。

BOMs 文件頭:

00 00 FE FF = UTF-32, big-endian

FF FE 00 00 = UTF-32, little-endian

EF BB BF = UTF-8,

FE FF = UTF-16, big-endian

FF FE = UTF-16, little-endian

注：jodconverter 2.2.1不支持docx 、xlsx、ppt、文件轉pdf