使用poi讀取word2007(.docx)中的複雜表格

最近工做須要作一個讀取word(.docx)中的表格,並以html形式輸出。這裏使用了poi。html

對於2007及以後的word文檔,須要導入poi-ooxml-xxx.jar及其依賴包,以下圖(圖中爲使用maven):pom.xmlgit

 

 

對於簡單表格,可使用以下方式來獲取每一個表格的內容:github

XWPFDocument document = new XWPFDocument(new FileInputStream("word.docx"));
// 獲取全部表格
List<XWPFTable> tables = document.getTables();
for (XWPFTable table : tables) {
    // 獲取表格的行
    List<XWPFTableRow> rows = table.getRows();
    for (XWPFTableRow row : rows) {
        // 獲取表格的每一個單元格
        List<XWPFTableCell> tableCells = row.getTableCells();
        for (XWPFTableCell cell : tableCells) {
             // 獲取單元格的內容
             String text = cell.getText();
        }
    }
}

可是對於複雜表格(含合併的單元格),則沒法正常處理。maven

因而繼續上網查詢,在stackoverflow查到以下生成含有合併的單元格的表格:tcp

public class CreateWordTableMerge {

    static void mergeCellVertically(XWPFTable table, int col, int fromRow, int toRow) {
        for(int rowIndex = fromRow; rowIndex <= toRow; rowIndex++){
            CTVMerge vmerge = CTVMerge.Factory.newInstance();
            if(rowIndex == fromRow){
                // The first merged cell is set with RESTART merge value
                vmerge.setVal(STMerge.RESTART);
            } else {
                // Cells which join (merge) the first one, are set with CONTINUE
                vmerge.setVal(STMerge.CONTINUE);
            }
            XWPFTableCell cell = table.getRow(rowIndex).getCell(col);
            // Try getting the TcPr. Not simply setting an new one every time.
            CTTcPr tcPr = cell.getCTTc().getTcPr();
            if (tcPr != null) {
                tcPr.setVMerge(vmerge);
            } else {
                // only set an new TcPr if there is not one already
                tcPr = CTTcPr.Factory.newInstance();
                tcPr.setVMerge(vmerge);
                cell.getCTTc().setTcPr(tcPr);
            }
        }
    }

    static void mergeCellHorizontally(XWPFTable table, int row, int fromCol, int toCol) {
        for(int colIndex = fromCol; colIndex <= toCol; colIndex++){
            CTHMerge hmerge = CTHMerge.Factory.newInstance();
            if(colIndex == fromCol){
                // The first merged cell is set with RESTART merge value
                hmerge.setVal(STMerge.RESTART);
            } else {
                // Cells which join (merge) the first one, are set with CONTINUE
                hmerge.setVal(STMerge.CONTINUE);
            }
            XWPFTableCell cell = table.getRow(row).getCell(colIndex);
            // Try getting the TcPr. Not simply setting an new one every time.
            CTTcPr tcPr = cell.getCTTc().getTcPr();
            if (tcPr != null) {
                tcPr.setHMerge(hmerge);
            } else {
                // only set an new TcPr if there is not one already
                tcPr = CTTcPr.Factory.newInstance();
                tcPr.setHMerge(hmerge);
                cell.getCTTc().setTcPr(tcPr);
            }
        }
    }

    public static void main(String[] args) throws Exception {

        XWPFDocument document= new XWPFDocument();

        XWPFParagraph paragraph = document.createParagraph();
        XWPFRun run=paragraph.createRun();
        run.setText("The table:");

        //create table
        XWPFTable table = document.createTable(3,5);

        for (int row = 0; row < 3; row++) {
            for (int col = 0; col < 5; col++) {
                table.getRow(row).getCell(col).setText("row " + row + ", col " + col);
            }
        }

        //create and set column widths for all columns in all rows
        //most examples don't set the type of the CTTblWidth but this
        //is necessary for working in all office versions
        for (int col = 0; col < 5; col++) {
            CTTblWidth tblWidth = CTTblWidth.Factory.newInstance();
            tblWidth.setW(BigInteger.valueOf(1000));
            tblWidth.setType(STTblWidth.DXA);
            for (int row = 0; row < 3; row++) {
                CTTcPr tcPr = table.getRow(row).getCell(col).getCTTc().getTcPr();
                if (tcPr != null) {
                    tcPr.setTcW(tblWidth);
                } else {
                    tcPr = CTTcPr.Factory.newInstance();
                    tcPr.setTcW(tblWidth);
                    table.getRow(row).getCell(col).getCTTc().setTcPr(tcPr);
                }
            }
        }

        //using the merge methods
        mergeCellVertically(table, 0, 0, 1);
        mergeCellHorizontally(table, 1, 2, 3);
        mergeCellHorizontally(table, 2, 1, 4);

        paragraph = document.createParagraph();

        FileOutputStream out = new FileOutputStream("create_table.docx");
        document.write(out);

        System.out.println("create_table.docx written successully");
    }
}

運行一下確實能夠實現,不過還是一頭霧水,對於其中的cTTc,tcPr,vMerge等屬性還是不知道是什麼。this

直到後來知道了Office Open XML (OOXML) ,能夠將.docx文件後綴改成.zip,便可以使用解壓軟件打開,進入後有一個word文件夾,裏面的document.xml即爲word正文內容。spa

 

對於word中的上圖行合併表格,對應的xml以下:3d

<w:tbl>
      <w:tblPr>
        <w:tblStyle w:val="a3"/>
        <w:tblW w:w="0" w:type="auto"/>
        <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
      </w:tblPr>
      <w:tblGrid>
        <w:gridCol w:w="2765"/>
        <w:gridCol w:w="2765"/>
      </w:tblGrid>
      <w:tr w:rsidR="00151AA4" w:rsidTr="000249EF">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
            <w:vMerge w:val="restart"/>
          </w:tcPr>
          <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4" w:rsidP="00915802">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>0,0</w:t>
            </w:r>
          </w:p>
        </w:tc>
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>0,1</w:t>
            </w:r>
          </w:p>
        </w:tc>
      </w:tr>
      <w:tr w:rsidR="00151AA4" w:rsidTr="000249EF">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
            <w:vMerge/>
          </w:tcPr>
          <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4"/>
        </w:tc>
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="00151AA4" w:rsidRDefault="00151AA4">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>1,1</w:t>
            </w:r>
            <w:bookmarkStart w:id="0" w:name="_GoBack"/>
            <w:bookmarkEnd w:id="0"/>
          </w:p>
        </w:tc>
      </w:tr>
    </w:tbl>

看到這裏,相信你們會理解了前面的tc,tcPr,vMerge等屬性了吧。rest

其中w:tr表示的是表格的一行,tcPr表明的是一個單元格的屬性。code

具體能夠參考:http://www.datypic.com/sc/ooxml/e-w_tbl-1.html

 

下面在給你們展現一下列合併的狀況,你們也能夠用來驗證一下:

對應的xml:

<w:tbl>
      <w:tblPr>
        <w:tblStyle w:val="a3"/>
        <w:tblW w:w="0" w:type="auto"/>
        <w:tblLook w:val="04A0" w:firstRow="1" w:lastRow="0" w:firstColumn="1" w:lastColumn="0" w:noHBand="0" w:noVBand="1"/>
      </w:tblPr>
      <w:tblGrid>
        <w:gridCol w:w="2765"/>
        <w:gridCol w:w="2765"/>
      </w:tblGrid>
      <w:tr w:rsidR="006C0A9A" w:rsidTr="006C099A">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="5530" w:type="dxa"/>
            <w:gridSpan w:val="2"/>
          </w:tcPr>
          <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>0,0</w:t>
            </w:r>
          </w:p>
        </w:tc>
      </w:tr>
      <w:tr w:rsidR="006C0A9A" w:rsidTr="000249EF">
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>1,0</w:t>
            </w:r>
          </w:p>
        </w:tc>
        <w:tc>
          <w:tcPr>
            <w:tcW w:w="2765" w:type="dxa"/>
          </w:tcPr>
          <w:p w:rsidR="006C0A9A" w:rsidRDefault="006C0A9A">
            <w:r>
              <w:rPr>
                <w:rFonts w:hint="eastAsia"/>
              </w:rPr>
              <w:t>1,1</w:t>
            </w:r>
          </w:p>
        </w:tc>
      </w:tr>
    </w:tbl>

經過觀察能夠總結以下(使用poi提供的方法):

行合併狀況:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr(); // 此屬性每一個單元格都有,爲每一個單元格的屬性:tableCell.cellProperty
若是是行合併的第一行單元格,則: tcpr.getVMerge().getVal().toString() == "restart"
若是是行合併的其餘行單元格,則: tcpr.getVMerge().getVal() == null
若是不是行合併的單元格,則: tcpr.getVMerge() == null

列合併狀況:
CTTcPr tcpr = tables.get(0).getRow(2).getCell(0).getCTTc().getTcPr();
若是是列合併的第一列單元格,則:tcpr.getGridSpan().getVal()能夠獲取到這列單元格所佔的行數
其餘單元格:tcpr.getGridSpan() == null

 

這裏有一個獲取表格內容轉爲html的demo供你們參考。(https://github.com/zavier/ReadWordTable

 

也歡迎你們關注個人新博客:https://zhengw-tech.com/

相關文章
相關標籤/搜索