最近有一個業務是前端要上傳word格式的文稿,而後用戶上傳完以後,能夠用瀏覽器直接查看該文稿,而且能夠在富文本框直接引用該文稿,因此上傳word文稿以後,後端保存到db的必須是html格式才行,因此涉及到word格式轉html格式。html
經過調查,這個word和html的處理,有兩種方案,方案1是前端作這個轉換。方案2是把word文檔上傳給後臺,後臺轉換好以後再返回給前端。至於方案1,看到你們的反饋都說不少問題,因此就沒采用前端轉的方案,最終決定是後端轉化爲html格式並返回給前段預覽,待客戶預覽的時候,確認格式沒問題以後,再把html保存到後臺(由於word涉及到的格式太多,好比圖片,visio圖,表格,圖片等等之類的複雜元素,轉html的時候,可能會不少格式問題,因此要有個預覽的過程)。前端
對於word中普通的文字,問題倒不大,主要是文本以外的元素的處理,好比圖片,視頻,表格等。針對我本次的文章,只處理了圖片,處理的方式是:後臺從word中找出圖片(固然引入的jar包已經帶了獲取word中圖片的功能),上傳到服務器,拿到絕對路徑以後,放入到html裏面,這樣,返回給前端的html內容,就能夠直接預覽了。java
maven引入相關依賴包以下:apache
<poi-scratchpad.version>3.14</poi-scratchpad.version> <poi-ooxml.version>3.14</poi-ooxml.version> <xdocreport.version>1.0.6</xdocreport.version> <poi-ooxml-schemas.version>3.14</poi-ooxml-schemas.version> <ooxml-schemas.version>1.3</ooxml-schemas.version> <jsoup.version>1.11.3</jsoup.version>
<dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-scratchpad</artifactId> <version>${poi-scratchpad.version}</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml</artifactId> <version>${poi-ooxml.version}</version> </dependency> <dependency> <groupId>fr.opensagres.xdocreport</groupId> <artifactId>xdocreport</artifactId> <version>${xdocreport.version}</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>poi-ooxml-schemas</artifactId> <version>${poi-ooxml-schemas.version}</version> </dependency> <dependency> <groupId>org.apache.poi</groupId> <artifactId>ooxml-schemas</artifactId> <version>${ooxml-schemas.version}</version> </dependency> <dependency> <groupId>org.jsoup</groupId> <artifactId>jsoup</artifactId> <version>${jsoup.version}</version> </dependency>
word轉html,對於word2003和word2007轉換方式不同,由於word2003和word2007的格式不同,工具類以下:後端
/** * @author ismallboy * @date 2020/2/19 */ @Component public class WordToHtmlUtil { @Autowired private UploadFileUtil uploadFileUtil; /** * 將word2003轉換爲html文件 * * @param input * @param bucket * @throws IOException * @throws TransformerException * @throws ParserConfigurationException */ public String Word2003ToHtml(InputStream input, String bucket, String directory, String visitPoint) throws IOException, TransformerException, ParserConfigurationException { HWPFDocument wordDocument = new HWPFDocument(input); WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter( DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument()); wordToHtmlConverter.setPicturesManager(new PicturesManager() { @Override public String savePicture(byte[] content, PictureType pictureType, String suggestedName, float widthInches, float heightInches) { String fileName = AliOssUtil.generateImageFileName() + suggestedName.substring(suggestedName.lastIndexOf(".")); return uploadFileUtil.uploadFile(content, bucket, directory, fileName, visitPoint); } }); // 解析word文檔 wordToHtmlConverter.processDocument(wordDocument); Document htmlDocument = wordToHtmlConverter.getDocument(); ByteArrayOutputStream baos = new ByteArrayOutputStream(); OutputStream outStream = new BufferedOutputStream(baos); DOMSource domSource = new DOMSource(htmlDocument); StreamResult streamResult = new StreamResult(outStream); TransformerFactory factory = TransformerFactory.newInstance(); Transformer serializer = factory.newTransformer(); serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8"); serializer.setOutputProperty(OutputKeys.INDENT, "yes"); serializer.setOutputProperty(OutputKeys.METHOD, "html"); serializer.transform(domSource, streamResult); String content = baos.toString(); baos.close(); return content; } /** * * 2007版本word轉換成html * * @param input * @param bucket * @param directory * @param visitPoint * @return * @throws IOException */ public String Word2007ToHtml(InputStream input, String bucket, String directory, String visitPoint) throws IOException { XWPFDocument document = new XWPFDocument(input); // 2) 解析 XHTML配置 (這裏設置IURIResolver來設置圖片存放的目錄) XHTMLOptions options = XHTMLOptions.create(); Map<String, String> imgMap = new HashMap<>(); options.setExtractor(new IImageExtractor() { @Override public void extract(String imagePath, byte[] imageData) throws IOException { //獲取圖片數據而且上傳 String fileName = AliOssUtil.generateImageFileName() + imagePath.substring(imagePath.lastIndexOf(".")); String imgUrl = uploadFileUtil.uploadFile(imageData, bucket, directory, fileName, visitPoint); imgMap.put(imagePath, imgUrl); } }); // html中圖片的路徑 相對路徑 options.URIResolver(new IURIResolver() { @Override public String resolve(String uri) { //設置圖片路徑 return imgMap.get(uri); } }); options.setIgnoreStylesIfUnused(false); options.setFragment(true); // 3) 將 XWPFDocument轉換成XHTML ByteArrayOutputStream baos = new ByteArrayOutputStream(); XHTMLConverter.getInstance().convert(document, baos, options); String content = baos.toString(); baos.close(); return content; } }
使用方法以下:瀏覽器
public String uploadSourceNews(MultipartFile file) { String fileName = file.getOriginalFilename(); String suffixName = fileName.substring(fileName.lastIndexOf(".")); if (!".doc".equals(suffixName) && !".docx".equals(suffixName)) { throw new UploadFileFormatException(); } DateTimeFormatter formatter = DateTimeFormatter.ofPattern("yyyyMM"); String dateDir = formatter.format(LocalDate.now()); String directory = imageDir + "/" + dateDir + "/"; String content = null; try { InputStream inputStream = file.getInputStream(); if ("doc".equals(suffixName)) { content = wordToHtmlUtil.Word2003ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost); } else { content = wordToHtmlUtil.Word2007ToHtml(inputStream, imageBucket, directory, Constants.HTTPS_PREFIX + imageVisitHost); } } catch (Exception ex) { logger.error("word to html exception, detail:", ex); return null; } return content; }
關於doc和docx的一些存儲格式介紹:服務器
docx 是微軟開發的基於 xml 的文字處理文件。docx 文件與 doc 文件不一樣, 由於 docx 文件將數據存儲在單獨的壓縮文件和文件夾中。早期版本的 microsoft office (早於 office 2007) 不支持 docx 文件, 由於 docx 是基於 xml 的, 早期版本將 doc 文件另存爲單個二進制文件。微信
DOCX is an XML based word processing file developed by Microsoft. DOCX files are different than DOC files as DOCX files store data in separate compressed files and folders. Earlier versions of Microsoft Office (earlier than Office 2007) do not support DOCX files because DOCX is XML based where the earlier versions save DOC file as a single binary file.dom
可能你會問了,明明是docx結尾的文檔,怎麼成了xml格式了?maven
很簡單:你隨便選擇一個docx文件,右鍵使用壓縮工具打開,就能獲得一個這樣的目錄結構:
因此你覺得docx是一個完整的文檔,其實它只是一個壓縮文件。
參考:
https://www.cnblogs.com/ct-csu/p/8178932.html
歡迎關注微信公衆號「ismallboy」,請掃碼並關注如下公衆號,並在公衆號下面回覆「word」,得到本文最新內容。