接入百度大腦表格文字識別技術，快速下降信息電子化錄入成本

時間 2019-11-20

標籤百度大腦表格文字識別技術快速下降信息電子錄入成本简体版

原文原文鏈接

使用表格文字識別技術，對我的、商品、公示內容等紙質信息登記表進行識別，快速實現表格內容的電子化，用於登記信息的結構化整理和統計，大幅度下降信息電子化工做的人力錄入成本，提高信息管理的便捷性python

一．平臺接入json

此步驟比較簡單，很少闡述。可參照以前文檔：app

https://ai.baidu.com/forum/topic/show/943162異步

二.分析接口文檔測試

1.打開API文檔頁面，分析接口要求url

https://ai.baidu.com/docs#/OCR-API/879328043d

(1)接口描述rest

對圖片中的表格文字內容進行提取和識別，結構化輸出表頭、表尾及每一個單元格的文字內容。支持識別常規表格及含合併單元格表格，並可選擇以JSON或Excel形式進行返回。excel

(2)請求說明code

須要用到的信息有：

請求URL：https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request

Header格式：Content-Type：application/x-www-form-urlencoded

Body中放置請求參數，參數詳情以下：

本接口爲異步接口，分爲兩個API：提交請求接口、獲取結果接口。這裏有一個關鍵參數：is_sync，取值爲「false」，需經過獲取結果接口獲取識別結果；取值爲「true」，同步返回識別結果，無需調用獲取結果接口。固然，能一次搞定的毫不用兩次，只需設置該參數爲「true」便可。

（3）返回參數

返回示例

{"result":

{"result_data":"http://bj.bcebos.com/v1/ai-edgecloud/4F00EC7AED4E4827BD517CB105E56DEB?authorization=bce-auth-v1%2Ff86a2044998643b5abc89b59158bad6d%2F2019-08-10T07%3A28%3A13Z%2F172800%2F%2F374c64232876bcbe78a54105e438a97376f530788e5386e04f67d0cba4935f3d",

"ret_msg":"\xe5\xb7\xb2\xe5\xae\x8c\xe6\x88\x90",

"percent":100,

"ret_code":3},

"log_id":1565422091617865}

2.獲取access_token

# encoding:utf-8

import base64

import urllib

import urllib2



request_url = " https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request "

# 二進制方式打開視頻文件

f = open('[本地文件]', 'rb')

img = base64.b64encode(f.read())

params = {"data": data }

params = urllib.urlencode(params)

access_token = '[調用鑑權接口獲取的token]'

request_url = request_url + "?access_token=" + access_token

request = urllib2.Request(url=request_url, data=params)

request.add_header('Content-Type', 'application/x-www-form-urlencoded')

response = urllib2.urlopen(request)

content = response.read()

if content:

print content

三.識別結果

識別結果：

結論：

識別結果方面：採用不一樣形式的複雜表格進行測試，識別結果比較準確，可以大大減小信息錄入工做。

處理速度方面：每張圖片處理時間在3-5s，能夠接受。

四.源碼共享

# -*- coding: utf-8 -*-

#!/usr/bin/env python

import urllib

import urllib.parse

import urllib.request

import base64

import json

import time

#client_id 爲官網獲取的AK， client_secret 爲官網獲取的SK

client_id = '*******************'

client_secret = '*********************'



#獲取token

def get_token():

    host = 'https://aip.baidubce.com/oauth/2.0/token?grant_type=client_credentials&client_id=' + client_id + '&client_secret=' + client_secret

    request = urllib.request.Request(host)

    request.add_header('Content-Type', 'application/json; charset=UTF-8')

    response = urllib.request.urlopen(request)

    token_content = response.read()

    if token_content:

        token_info = json.loads(token_content.decode("utf-8"))

        token_key = token_info['access_token']

    return token_key



     # 讀取圖片

def get_file_content(filePath):

    with open(filePath, 'rb') as fp:

        return fp.read()





#獲取表格信息

def get_license_plate(path):



    request_url = "https://aip.baidubce.com/rest/2.0/solution/v1/form_ocr/request"

   

    f = get_file_content(path)

    access_token=get_token()

    print (access_token)

    img = base64.b64encode(f)

#    params = {"image": img,"is_sync": 'true',"request_type": 'json'}

    params = {"image": img,"is_sync": 'true',"request_type": 'excel'}

    params = urllib.parse.urlencode(params).encode('utf-8')

    request_url = request_url + "?access_token=" + access_token

    tic = time.clock()

    request = urllib.request.Request(url=request_url, data=params)

    request.add_header('Content-Type', 'application/x-www-form-urlencoded')

    response = urllib.request.urlopen(request)

    content = response.read()

    toc = time.clock()

    print('處理時長: '+'%.2f'  %(toc - tic) +' s')

    if content:

        print (content)

        license_plates = json.loads(content.decode("utf-8"))

        excel_url = license_plates['result']['result_data']

        excel = urllib.request.urlopen(excel_url)

        with open("sbg.xls", "wb") as code:

            code.write(excel.read())

        return content

    else:

        return ''



image_path='F:\paddle\sbg\s6.jpg'

get_license_plate(image_path)

五.意見建議

1.總體識別效果仍是不錯的，識別結果的精確度還有待提升，細節處理還能夠更完善。好比複雜表格識別文字串行，個別文字丟失或錯誤等。

2.對錶格中有手寫體文字的識別效果很差，建議增長對手寫輸入的識別。

做者：wangwei8638