這裏就不對POI作過多的說明了,貼個官網 https://poi.apache.org/,隨意看看。html
首先搞清楚下要將doc/docx文檔轉成html/htm的話要怎麼處理,根據POI的文檔,咱們能夠知道,處理doc 格式文件對應的 POI API 爲 HWPF、docx 格式爲 XWPF。此處參考下這篇好文:http://www.open-open.com/lib/view/open1389594797523.html 在格式轉換上說得很清楚。java
因此總體就是:根據文檔類型,doc咱們用HWPF對象處理轉換、docx用XWPF對象處理轉換。apache
順便貼一下這個在線文檔 http://poi.apache.org/apidocs/index.html,不得不說看得至關麻煩,特別是XWPF的。api
1、處理doc。dom
這個相對簡單,網上一查一堆,個人代碼也是根據網上的作下本身的優化和邏輯。字體
由於POI很早前就能夠支持doc的處理,因此資料比較多。優化
思路就是:HWPFDocument對象實例化文件流 -> WordToHtmlConverter對象處理HWPFDocument對象及預處理頁面的圖片等(主要是圖片)ui
文檔說明是:編碼
Converts Word files (95-2007) into HTML files. This implementation doesn't create images or links to them. This can be changed by overriding AbstractWordConverter.processImage(Element, boolean, Picture) method.
-> org.w3c.dom.Document對象處理WordToHtmlConverter,生成DOM對象 -> 輸出文件。code
這裏有個好處就是使用到了Document對象,從而解決了編碼、文件格式等問題。
這裏由於過程簡單,直接貼簡單demo,看註釋便可:
import java.io.ByteArrayOutputStream; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; import java.io.FileOutputStream; import java.io.InputStream; import java.io.OutputStream; import java.util.List; import javax.xml.parsers.DocumentBuilderFactory; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.dom.DOMSource; import javax.xml.transform.stream.StreamResult; import org.apache.commons.io.FileUtils; import org.apache.poi.hwpf.HWPFDocument; import org.apache.poi.hwpf.converter.PicturesManager; import org.apache.poi.hwpf.converter.WordToHtmlConverter; import org.apache.poi.hwpf.usermodel.Picture; import org.apache.poi.hwpf.usermodel.PictureType; import org.apache.poi.xwpf.converter.core.FileImageExtractor; import org.apache.poi.xwpf.converter.core.FileURIResolver; import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter; import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFPictureData; import org.w3c.dom.Document; public class POIForeViewUtil { public void parseDocx2Html() throws Throwable { final String path = "F:\\"; final String file = "xxxxxxx.doc"; InputStream input = new FileInputStream(path + file); String suffix = file.substring(file.indexOf(".")+1);// //截取文件格式名 //實例化WordToHtmlConverter,爲圖片等資源文件作準備 WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder() .newDocument()); wordToHtmlConverter.setPicturesManager(new PicturesManager() { public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) { return suggestedName; } }); if ("doc".equals(suffix.toLowerCase())) { // docx HWPFDocument wordDocument = new HWPFDocument(input); wordToHtmlConverter.processDocument(wordDocument); //處理圖片,會在同目錄下生成 image/media/ 路徑並保存圖片 List pics = wordDocument.getPicturesTable().getAllPictures(); if (pics != null) { for (int i = 0; i < pics.size(); i++) { Picture pic = (Picture) pics.get(i); try { pic.writeImageContent(new FileOutputStream(path + pic.suggestFullFileName())); } catch (FileNotFoundException e) { e.printStackTrace(); } } } } // 轉換 Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream outStream = new ByteArrayOutputStream(); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outStream); TransformerFactory tf = TransformerFactory.newInstance(); Transformer serializer = tf.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");//編碼格式 serializer.setOutputProperty(OutputKeys.INDENT, "yes");//是否用空白分割 serializer.setOutputProperty(OutputKeys.METHOD, "html");//輸出類型 serializer.transform(domSource, streamResult); outStream.close(); String content = new String(outStream.toByteArray()); FileUtils.writeStringToFile(new File(path, "interface.html"), content, "utf-8"); } public static void main(String[] args) throws Throwable { new POIForeViewUtil().parseDocx2Html(); } }
接着看第二種
2、處理docx。
docx是07的版本,處理起來困難的多,貌似POI對docx的處理方法沒有doc那麼便捷,處理樣式等等都有問題,我遇到的兩個最明顯問題就是字體編碼問題和表格的邊框線顯示。
思路:XWPFDocument加載文件流 -> XHTMLOptions處理頁面資源(主要圖片) -> OutputStream輸出流直接輸出文件。
過程代碼至關簡單,但是越簡單結果約沒有預期的好。輸出的文件字體編碼默認爲GBK,例如個人「微軟雅黑」字體就變成「寰蔣闆呴粦」,並且節點的顯示也沒有doc處理的好。
一樣貼一下demo代碼:
import java.io.File; import java.io.FileInputStream; import java.io.FileOutputStream; import java.io.IOException; import java.io.InputStream; import java.io.OutputStream; import javax.xml.transform.OutputKeys; import javax.xml.transform.Transformer; import javax.xml.transform.TransformerFactory; import javax.xml.transform.stream.StreamResult; import org.apache.poi.xwpf.converter.core.FileImageExtractor; import org.apache.poi.xwpf.converter.core.FileURIResolver; import org.apache.poi.xwpf.converter.xhtml.XHTMLConverter; import org.apache.poi.xwpf.converter.xhtml.XHTMLOptions; import org.apache.poi.xwpf.usermodel.XWPFDocument; import org.apache.poi.xwpf.usermodel.XWPFPictureData; public class Word07ToHtml { public static void parseToHtml() throws IOException { File f = new File("F:/xxxxx.docx"); if (!f.exists()) { System.out.println("Sorry File does not Exists!"); } else { if (f.getName().endsWith(".docx") || f.getName().endsWith(".DOCX")) { // 1) 加載XWPFDocument及文件 InputStream in = new FileInputStream(f); XWPFDocument document = new XWPFDocument(in); // 2) 實例化XHTML內容(這裏將會把圖片等文件放到生成的"word/media"目錄) File imageFolderFile = new File("f:/opt"); XHTMLOptions options = XHTMLOptions.create().URIResolver( new FileURIResolver(imageFolderFile)); options.setExtractor(new FileImageExtractor(imageFolderFile)); //options.setIgnoreStylesIfUnused(false); //options.setFragment(true); // 3) 將XWPFDocument轉成XHTML並生成文件 OutputStream out = new FileOutputStream(new File( "F:/result.html")); XHTMLConverter.getInstance().convert(document, out, null); } else { System.out.println("Enter only MS Office 2007+ files"); } } } public static void main(String args[]) { try { //String string = new String("寰蔣闆呴粦".getBytes("GBK"), "UTF-8"); //System.out.println(string); parseToHtml(); } catch (IOException e) { e.printStackTrace(); } } }
因爲已將兩個Demo移出項目,沒有截圖。
POI的jar包下載路徑:
https://archive.apache.org/dist/poi/release/bin/poi-bin-3.9-20121203.zip