Java：POI方式實現Word轉html/htm

時間 2019-11-07

標籤 java poi 方式實現 word html htm 欄目 Java 简体版

原文原文鏈接

這裏就不對POI作過多的說明了，貼個官網 https://poi.apache.org/，隨意看看。html

首先搞清楚下要將doc/docx文檔轉成html/htm的話要怎麼處理，根據POI的文檔，咱們能夠知道，處理doc 格式文件對應的 POI API 爲 HWPF、docx 格式爲 XWPF。此處參考下這篇好文：http://www.open-open.com/lib/view/open1389594797523.html 在格式轉換上說得很清楚。java

因此總體就是：根據文檔類型，doc咱們用HWPF對象處理轉換、docx用XWPF對象處理轉換。apache

順便貼一下這個在線文檔 http://poi.apache.org/apidocs/index.html，不得不說看得至關麻煩，特別是XWPF的。api

1、處理doc。dom

這個相對簡單，網上一查一堆，個人代碼也是根據網上的作下本身的優化和邏輯。字體

由於POI很早前就能夠支持doc的處理，因此資料比較多。優化

思路就是：HWPFDocument對象實例化文件流 -> WordToHtmlConverter對象處理HWPFDocument對象及預處理頁面的圖片等（主要是圖片）ui

文檔說明是：編碼

Converts Word files (95-2007) into HTML files.
This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.

-> org.w3c.dom.Document對象處理WordToHtmlConverter，生成DOM對象 -> 輸出文件。code

這裏有個好處就是使用到了Document對象，從而解決了編碼、文件格式等問題。

這裏由於過程簡單，直接貼簡單demo，看註釋便可：

import java.io.ByteArrayOutputStream;
import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;
import java.io.FileOutputStream;
import java.io.InputStream;
import java.io.OutputStream;
import java.util.List;

import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;

import org.apache.commons.io.FileUtils;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.PicturesManager;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.usermodel.Picture;
import org.apache.poi.hwpf.usermodel.PictureType;
import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFPictureData;
import org.w3c.dom.Document;

public class POIForeViewUtil {

	public void parseDocx2Html() throws Throwable {
		final String path = "F:\\";
		final String file = "xxxxxxx.doc";
		InputStream input = new FileInputStream(path + file);
		String suffix = file.substring(file.indexOf(".")+1);// //截取文件格式名

		//實例化WordToHtmlConverter，爲圖片等資源文件作準備
		WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(
				DocumentBuilderFactory.newInstance().newDocumentBuilder()
						.newDocument());
		wordToHtmlConverter.setPicturesManager(new PicturesManager() {
			public String savePicture(byte[] content, PictureType pictureType,
					String suggestedName, float widthInches, float heightInches) {
				return suggestedName;
			}
		});
		if ("doc".equals(suffix.toLowerCase())) {
			// docx
			HWPFDocument wordDocument = new HWPFDocument(input);
			wordToHtmlConverter.processDocument(wordDocument);
			//處理圖片，會在同目錄下生成 image/media/ 路徑並保存圖片
			List pics = wordDocument.getPicturesTable().getAllPictures();
			if (pics != null) {
				for (int i = 0; i < pics.size(); i++) {
					Picture pic = (Picture) pics.get(i);
					try {
						pic.writeImageContent(new FileOutputStream(path
								+ pic.suggestFullFileName()));
					} catch (FileNotFoundException e) {
						e.printStackTrace();
					}
				}
			}
		} 

		// 轉換
		Document htmlDocument = wordToHtmlConverter.getDocument();
		ByteArrayOutputStream outStream = new ByteArrayOutputStream();
		DOMSource domSource = new DOMSource(htmlDocument);
		StreamResult streamResult = new StreamResult(outStream);
		TransformerFactory tf = TransformerFactory.newInstance();
		Transformer serializer = tf.newTransformer();
		serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");//編碼格式
		serializer.setOutputProperty(OutputKeys.INDENT, "yes");//是否用空白分割
		serializer.setOutputProperty(OutputKeys.METHOD, "html");//輸出類型
		serializer.transform(domSource, streamResult);
		outStream.close();
		String content = new String(outStream.toByteArray());
		FileUtils.writeStringToFile(new File(path, "interface.html"), content,
				"utf-8");
	}

	public static void main(String[] args) throws Throwable {
		new POIForeViewUtil().parseDocx2Html();
	}

}

接着看第二種

2、處理docx。

docx是07的版本，處理起來困難的多，貌似POI對docx的處理方法沒有doc那麼便捷，處理樣式等等都有問題，我遇到的兩個最明顯問題就是字體編碼問題和表格的邊框線顯示。

思路：XWPFDocument加載文件流 -> XHTMLOptions處理頁面資源（主要圖片） -> OutputStream輸出流直接輸出文件。

過程代碼至關簡單，但是越簡單結果約沒有預期的好。輸出的文件字體編碼默認爲GBK，例如個人「微軟雅黑」字體就變成「寰蔣闆呴粦」，並且節點的顯示也沒有doc處理的好。

一樣貼一下demo代碼：

import java.io.File;
import java.io.FileInputStream;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.OutputStream;

import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.stream.StreamResult;

import org.apache.poi.xwpf.converter.core.FileImageExtractor;
import org.apache.poi.xwpf.converter.core.FileURIResolver;
import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter;
import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.apache.poi.xwpf.usermodel.XWPFPictureData;

public class Word07ToHtml {

	public static void parseToHtml() throws IOException {
		File f = new File("F:/xxxxx.docx");
		if (!f.exists()) {
			System.out.println("Sorry File does not Exists!");
		} else {
			if (f.getName().endsWith(".docx") || f.getName().endsWith(".DOCX")) {
				
				// 1) 加載XWPFDocument及文件
				InputStream in = new FileInputStream(f);
				XWPFDocument document = new XWPFDocument(in);

				// 2) 實例化XHTML內容(這裏將會把圖片等文件放到生成的"word/media"目錄)
				File imageFolderFile = new File("f:/opt");
				XHTMLOptions options = XHTMLOptions.create().URIResolver(
						new FileURIResolver(imageFolderFile));
				options.setExtractor(new FileImageExtractor(imageFolderFile));
				//options.setIgnoreStylesIfUnused(false);
				//options.setFragment(true);
				
				// 3) 將XWPFDocument轉成XHTML並生成文件
				OutputStream out = new FileOutputStream(new File(
						"F:/result.html"));
				XHTMLConverter.getInstance().convert(document, out, null);
			} else {
				System.out.println("Enter only MS Office 2007+ files");
			}
		}
	}
	
	public static void main(String args[]) {
		try {
		    //String string = new String("寰蔣闆呴粦".getBytes("GBK"), "UTF-8");
		    //System.out.println(string);
			parseToHtml();
		} catch (IOException e) {
			e.printStackTrace();
		}
	}
}

因爲已將兩個Demo移出項目，沒有截圖。

POI的jar包下載路徑：

https://archive.apache.org/dist/poi/release/bin/poi-bin-3.9-20121203.zip