獲取word文件信息

時間 2020-06-19

原文原文鏈接

讀取doc，docx文件

使用office，wps工具打開文件另存爲python

針對doc文件，antiword 提取doc文件信息linux

2.1 安裝windows

windows工具

一、 下載zip包 地址：http://www.winfield.demon.nl/
二、 解壓至指定目錄，配置系統環境變量path
三、 使用antiword命令行操做doc文件

Linux命令行

一、 wget http://www.winfield.demon.nl/linux/antiword-0.37.tar.gz
二、 tar -zxvf antiword-0.37.tar.gz
三、 cd antiword-0.37
四、 make && make install

2.2 使用說明code

antiword -t xxx.doc 輸出文件信息
antiword -f xxx.doc 格式化輸出文件信息
antiword -f xxx.doc >> xxx.txt 讀取doc文件信息並保存爲txt文件

針對docx文件，使用python-docx包ip

pip install python-docx

def trans_docx_txt(path):
    '''
    fun: docx 文件轉txt文件
    '''
    newpath = "".join(path.split(".")[:-1]) + ".txt"
    # 保存的文件若是存在須要先刪除
    if os.path.exists(newpath):
        os.remove(newpath)
    # 空文件不處理
    if os.path.getsize(path) == 0:
        return

    file = docx.Document(path)
    # 讀取段落
    for paragraph in file.paragraphs:
        if paragraph.text:
            with open(newpath, "a+", encoding="utf-8") as file:
                file.write(paragraph.text + "\n")

    file = docx.Document(path)
    # 若是存在表格，讀取表中單元格信息
    if file.tables:
        for table in file.tables:
            row_count = len(table.rows)
            colu_count = len(table.columns)
            for i in range(row_count):
                for j in range(colu_count):
                    with open(newpath, "a+", encoding="utf-8") as file:
                        file.write(table.cell(i, j).text + "\n")

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。