JAVA OCR Tesseract 識別代碼實現

時間 2020-01-30

標籤 java ocr tesseract 識別代碼實現欄目 Java 简体版

原文原文鏈接

Tesseract OCR 介紹

Tesseract的OCR引擎最早由HP實驗室於1985年開始研發，至1995年時已經成爲OCR業內最準確的三款識別引擎之一。然而，HP不久便決定放棄OCR業務，Tesseract也從今後塵封。html

數年之後，HP意識到，與其將Tesseract束之高閣，不如貢獻給開源軟件業，讓其重煥新生－－2005年，Tesseract由美國內華達州信息技術研究所得到，並求諸於Google對Tesseract進行改進、消除Bug、優化工做。java

在修復了最重要的數個漏洞後，Google認爲，Tesseract OCR已經足夠穩定，能夠從新以開源軟件方式發佈。git

Tesseract GitHub 地址github

1 Windows 下安裝測試

安裝 Microsoft Visual C++ 2015 ,根據您的操做系統位數選擇32 或 64 下載地址
配置環境變量 TESSDATA_PREFIX , 值就是您 Tesseract 的路徑，好比D:\abc\def\Tesseract-OCR

不安裝vc 2015 會出現 gs 等找不到的錯誤web

不配置環境變量，在使用時，會出現 Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory 錯誤apache

若是在安裝 vc 2015 出現 0x80240017 -未指定的錯誤，請參考此處解決windows

使用命令行測試

:: cd 到對應目錄

tesseract sourcefile.jpg savename -l chi_sim

2. MacOS 下安裝測試

使用 Homebrew 安裝 tesseract 便可,brew 會自動安裝依賴包。api

brew install tesseract

使用命令行測試

tesseract sourcefile.jpg savename -l chi_sim

3. JAVA 環境

3.1 開發環境

JDK 1.8
IDEA 2017
junit 4.2
log4j2
Maven3.x

3.2 pom.xml

若是不是 maven 項目跳過此步驟便可app

<properties>
      <java.version>1.8</java.version>
      <log4j2.version>2.1</log4j2.version>
   </properties>

   <dependencies>
      <dependency>
         <groupId>junit</groupId>
         <artifactId>junit</artifactId>
         <version>4.12</version>
         <scope>test</scope>
      </dependency>

      <!--日誌包-->
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-core</artifactId>
         <version>${log4j2.version}</version>
      </dependency>
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-api</artifactId>
         <version>${log4j2.version}</version>
      </dependency>
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-web</artifactId>
         <version>${log4j2.version}</version>
      </dependency>
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-slf4j-impl</artifactId>
         <version>${log4j2.version}</version>
      </dependency>
      <dependency>
         <groupId>org.apache.logging.log4j</groupId>
         <artifactId>log4j-jcl</artifactId>
         <version>${log4j2.version}</version>
      </dependency>

      <dependency>
         <groupId>org.apache.directory.studio</groupId>
         <artifactId>org.apache.commons.lang</artifactId>
         <version>2.6</version>
      </dependency>

   </dependencies>

3.3 log4j2.xml

具體的 log4j2 配置，可參考: Log4j1 升級 Log4j2 實戰dom

<?xml version="1.0" encoding="UTF-8"?>

<Configuration status="warn" monitorInterval="600">

    <Properties>
        
        <property name="BASE_LOG_PATTERN">%5p %d{yyyy-MM-dd HH:mm:ss.SSS} [%t] (%class{36}:%L）%M - %m%n</property>
        <property name="LOG_DIR_HOME">logs</property>
        <property name="BASE_LOG_FILENAME">nuna-ocr</property>
    </Properties>

    <Appenders>
        <Console name="Console" target="SYSTEM_OUT">
            <ThresholdFilter level="trace" onMatch="ACCEPT" onMismatch="DENY" immediateFlush="true"/>
            <PatternLayout pattern="${BASE_LOG_PATTERN}" />
        </Console>

        <RollingRandomAccessFile name="stdout_appender"
                                 immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout.log"
                                 filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout-%d{yyyy-MM-dd}_%i.log.gz">
            <PatternLayout>
                <pattern>${BASE_LOG_PATTERN}</pattern>
            </PatternLayout>
            <Policies>
                <TimeBasedTriggeringPolicy modulate="true" interval="1"/>
                <SizeBasedTriggeringPolicy size="5120 KB"/>
            </Policies>

            <DefaultRolloverStrategy max="3"/>

        </RollingRandomAccessFile>

        <RollingRandomAccessFile name="error_appender"
                                 immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error.log"
                                 filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error-%d{yyyy-MM-dd}_%i.log.gz">
            <PatternLayout>
                <pattern>${BASE_LOG_PATTERN}</pattern>
            </PatternLayout>
            <Policies>
                <TimeBasedTriggeringPolicy modulate="true" interval="1"/>
                <SizeBasedTriggeringPolicy size="5120 KB"/>
            </Policies>
            <Filters>
                <ThresholdFilter level="warn" onMatch="ACCEPT" onMismatch="DENY"/>
            </Filters>
            <DefaultRolloverStrategy max="3"/>
        </RollingRandomAccessFile>
    </Appenders>

    <Loggers>
        <logger name="com.liu.app.ocr" level="debug" additivity="false" >
            <appender-ref ref="Console" />
            <appender-ref ref="stdout_appender" />
            <appender-ref ref="error_appender" />
        </logger>

        <root level="info" includeLocation="true">
            <appender-ref ref="Console" />
            <appender-ref ref="stdout_appender" />
            <appender-ref ref="error_appender"/>
        </root>
    </Loggers>

</Configuration>

4. JAVA 代碼實現

import java.io.*;
import java.util.*;
import java.util.ArrayList;
import java.util.List;

import org.apache.commons.lang.StringUtils;
import org.apache.logging.log4j.LogManager;
import org.apache.logging.log4j.Logger;

public class OcrProcess {

    private static Logger logger = LogManager.getLogger(OcrProcess.class);

    private String tessPath = "";

    private String ocr_psm_num = "3";

    private String ocr_language = "eng";

    private String ocr_result_path = "";

    private String ocr_result_file_path = "";

    private final String OS = System.getProperty("os.name");

    private final String ocr_lang_option = "-l";

    public final String ocr_psm_option = "-psm";

    private long procID = 0;

    /**
     * 文本換行符
     */
    private final String EOL = System.getProperty("line.separator");

    /**
     * 當前系統，路徑分割符
     */
    private final String FIS = System.getProperties().getProperty("file.separator");

    /**
     * 構造函數
     * @param tessPath 本地 tesseract 路徑
     */
    public OcrProcess(String tessPath) {
        this.tessPath = tessPath;

        if(logger.isDebugEnabled()){
            logger.debug("OcrProcess Created .");
            logger.debug("OcrProcess Current OS >>> {} ",this.OS);
            logger.debug("OcrProcess User tessPath >>> {} ",this.tessPath);
        }else if(logger.isInfoEnabled()){
            logger.info("OcrProcess Created .");
        }
    }

    /**
     * 設置 pagesegmode 1-10
     * 0 = Orientation and script detection (OSD) only.
     * 1 = Automatic page segmentation with OSD.
     * 2 = Automatic page segmentation, but no OSD, or OCR.
     * 3 = Fully automatic page segmentation, but no OSD. (Default)
     * 4 = Assume a single column of text of variable sizes.
     * 5 = Assume a single uniform block of vertically aligned text.
     * 6 = Assume a single uniform block of text.
     * 7 = Treat the image as a single text line.  // 識別內容爲 橫行， 能夠提供單行文本的識別效率
     * 8 = Treat the image as a single word.
     * 9 = Treat the image as a single word in a circle.
     * 10 = Treat the image as a single character.
     * @param psm 參考 tesseract 文檔
     */
    public void setPageSegMode(Integer psm){

        if(logger.isDebugEnabled()){
            logger.debug("OcrProcess User set OCR Process PageSeqMode >>> {}",psm);
        }else if(logger.isInfoEnabled()){
            logger.info("OcrProcess User set OCR Process PageSeqMode >>> {}",psm);
        }

        if(psm == null){
            throw new IllegalArgumentException("param psm is null,will use default 3");
        }
        if(psm > 10 || psm < 0){
            throw new IllegalArgumentException("param psm only between 0 and 10,will use default 3");
        }

        this.ocr_psm_num = String.valueOf(psm);
    }

    /**
     * 設置保存路徑
     * @param savePath
     */
    public void setSaveDir(String savePath) throws FileNotFoundException {

        if(logger.isDebugEnabled()){
            logger.debug("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath);
        }else if(logger.isInfoEnabled()){
            logger.info("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath);
        }

        File saveDir = new File(savePath);

        if(!saveDir.exists()){
            throw new FileNotFoundException("the savePath is not found!");
        }

        this.ocr_result_path = savePath;

    }

    /**
     *
     * @author alexliu
     * @date：2017年11月22日 下午2:39:42
     * @Description：文件 ocr
     * @param file 文件
     * @param language 語言  chi_sim ,eng
     * @return 識別後文件路徑
     * @throws NoSupportFileTypeException 自定義異常
     */
    public String doOCR(File file , String language) throws NoSupportFileTypeException {
        // 建立一個 ocr 執行id，便於日誌、數據記錄
        //this.procID = OcrTools.createProcessId();
        this.procID = new Random().nextInt(1000);

        if(logger.isDebugEnabled()){
            logger.debug("OcrProcess Begin. The file >>> [{}], the procID is [{}] .",file.getName(),this.procID);
        }else if(logger.isInfoEnabled()){
            logger.info("OcrProcess Begin. The file >>> [{}], the procID is [{}] . See More info PLZ set logger Debug or Trace.",file.getName(),this.procID);
        }

        String textSavePath = "";

        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : FilePath >>> {}",this.procID,file.getAbsolutePath());
            logger.debug("[{}] OCR process info : FileName >>> {}",this.procID,file.getName());
            logger.debug("[{}] OCR process info : FileSize >>> {} KB",this.procID,(file.length() / 1024f));
            logger.debug("[{}] OCR process info : OCR Language >>> {}",this.procID,language);
        }

        this.ocr_language = language;

        //獲取文件類型
        int fileType = this.getFileType(file.getName());

        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : FileType >>> {}",this.procID,fileType);
        }
        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : The FileType is Image .",this.procID);
        }
        textSavePath = callTesseractCommand(file);

        return textSavePath;
    }

    /**
     * 獲取輸出路徑
     * 若是 用戶調用了 `setSaveDir` ,那麼 OCR 輸出到用戶設置的目錄
     * 若是 沒有調用 `setSaveDir` ，那麼 OCR 輸出與識別文件同一目錄
     * @param sourceFile 原始文件
     * @return
     */
    private String getOutPutDir(File sourceFile){

        if(StringUtils.isEmpty(this.ocr_result_path)){
            return sourceFile.getAbsolutePath().substring(0,sourceFile.getAbsolutePath().lastIndexOf(this.FIS));
        }

        return this.ocr_result_path;
    }

    /**
     * 獲取文件類型
     * @param fileName
     * @return
     * @throws NoSupportFileTypeException 自定義異常
     */
    private Integer getFileType(String fileName) throws NoSupportFileTypeException {

        //此處根據文件後綴名判斷是不是能夠執行 OCR 的文件類型
        //Integer type = OcrTools.getFileType(fileName);

        Integer type = null;

        if(type == null){
            // NoSupportFileTypeException 爲自定義異常，此處可自定義您的內容
            //throw new NoSupportFileTypeException("["+fileName+"] , 沒法識別的文件類型.");
        }

        return type;
    }

    /**
     * 建立 ProcessBuilder
     * @param sourceFile
     * @return
     */
    private ProcessBuilder createProcessBuilder(File sourceFile){

        ProcessBuilder pb = new ProcessBuilder();

        if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){

            //設置命令行工做目錄，Linux , Mac OS 設置在 tesseract 目錄下
            //由於 tesseract 非安裝模式，也沒有添加到系統的環境變量中
            pb.directory(new File(this.tessPath));
            if(logger.isDebugEnabled()){
                logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,this.tessPath);
            }

        }else if(this.OS.startsWith("Windows")){
            //設置命令行工做目錄,windows 下設置在要解析的文件目錄下
            pb.directory(sourceFile.getParentFile());
            if(logger.isDebugEnabled()){
                logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,sourceFile.getParentFile());
            }
        }

        //輸出錯誤日誌流
        pb.redirectErrorStream(true);

        return pb;
    }

    /**
     * 建立命令行
     * @param sourceFile
     * @return
     */
    private List<String> createCommand(File sourceFile){

        List<String> cmd = new ArrayList<String>();

        //ocr 輸出目錄
        String result_out_dir = getOutPutDir(sourceFile);

        //ocr 命令輸出文件名，沒有後綴
        String ocr_result_filename = sourceFile.getName().substring(0 , sourceFile.getName().lastIndexOf("."));

        //ocr 輸出文件名,有後綴
        String result_out_filePath = result_out_dir + this.FIS + ocr_result_filename + ".txt";

        this.ocr_result_file_path = result_out_filePath;

        if(logger.isDebugEnabled()) {
            logger.debug("[{}] OCR process info : Result SaveDir >>> {}",this.procID,result_out_dir);
            logger.debug("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath);
        }else if(logger.isInfoEnabled()){
            logger.info("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath);
        }

        if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){
            cmd.add("tesseract");
            // Linux or Mac 要設置工做目錄爲 tesseract 目錄 ，因此`原始文件名`需包含路徑
            cmd.add(sourceFile.getAbsolutePath());

        }else if(this.OS.startsWith("Windows")){
            cmd.add(this.tessPath + this.FIS + "tesseract");
            // windows 要設置工做目錄爲文件目錄，因此`原始文件名`沒有路徑，只有文件名
            cmd.add(sourceFile.getName());
        }

        cmd.add(result_out_dir + this.FIS + ocr_result_filename);
        cmd.add(this.ocr_psm_option);
        cmd.add(this.ocr_psm_num);
        cmd.add(this.ocr_lang_option);
        cmd.add(this.ocr_language);

        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString());
        }else if(logger.isInfoEnabled()){
            logger.info("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString());
        }

        return cmd;
    }

    /**
     * 處理識別後的空格字符
     * @param txtFilePath
     * @throws FileNotFoundException
     */
    private void processSpace(String txtFilePath) throws FileNotFoundException {

        FileInputStream fis = null;
        InputStreamReader isr = null;
        BufferedReader br = null;
        String str;

        FileOutputStream fos = null;
        OutputStreamWriter osw = null;
        BufferedWriter bw = null;

        boolean readSuccess = true;

        File txtFile = new File(txtFilePath);

        if(!txtFile.exists()){
            throw new FileNotFoundException("OCR process Result file is not exist!");
        }

        StringBuilder sb = new StringBuilder();

        try {
            //讀取文件
            fis = new FileInputStream(txtFile);
            isr = new InputStreamReader(fis,"UTF-8");
            br = new BufferedReader(isr);

            while ((str = br.readLine()) != null) {
                sb.append(str).append(this.EOL);
            }
        } catch (Exception e){
            logger.error("[{}] OCR process space read file faild .",this.procID,e);
            readSuccess = false;
        } finally {
            try {
                br.close();
            }catch (Exception e){
                //ignore;
            }
            try {
                isr.close();
            }catch (Exception e){
                //ignore;
            }
            try {
                fis.close();
            }catch (Exception e){
                //ignore;
            }
        }

        //處理空格
        if(readSuccess){
            try {
                //寫出文件
                fos = new FileOutputStream(txtFile);
                osw = new OutputStreamWriter(fos,"UTF-8");
                bw = new BufferedWriter(osw);
                bw.write(sb.toString().replaceAll(" ", ""));
            } catch (Exception e){
                logger.error("[{}] OCR process space write file faild .",this.procID,e);
            } finally {
                try {
                    bw.close();
                }catch (Exception e){
                    //ignore;
                }
                try {
                    osw.close();
                }catch (Exception e){
                    //ignore;
                }
                try {
                    fos.close();
                }catch (Exception e){
                    //ignore;
                }
            }
        }
    }

    /**
     * 打印命令行執行錯誤日誌
     * @param process
     */
    private void printCommandError(Process process){
        InputStream fis = null;
        InputStreamReader isr = null;
        BufferedReader br = null;
        try {
            // 取得命令結果的輸出流
            fis = process.getInputStream();
            // 用一個讀輸出流類去讀
            isr = new InputStreamReader(fis);
            // 用緩衝器讀行
            br = new BufferedReader(isr);
            String line = null;
            // 直到讀完爲止
            while ((line = br.readLine()) != null) {
                logger.warn("[{}] OCR process warning : {} ",this.procID,line);
            }
        } catch (Exception e){
            logger.warn("[{}] OCR process print command error Faild!",e);
        } finally {
            try {
                br.close();
            } catch (Exception e){
                //ignore
            }
            try {
                isr.close();
            } catch (Exception e){
                //ignore
            }
            try {
                fis.close();
            } catch (Exception e){
                //ignore
            }
        }
    }

    /**
     *
     * @author alexliu
     * @date：2017年11月22日 下午2:47:25
     * @Description：用命令行執行ocr
     * @param imageFile
     * @return 返回識別後的文件路徑目
     * @throws Exception
     */
    private String callTesseractCommand(File imageFile) {

        String txt_path = "";

        List<String> comand = this.createCommand(imageFile);

        ProcessBuilder pb = this.createProcessBuilder(imageFile);

        //添加命令行
        pb.command(comand);

        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : Add command to ProcessBuilder",this.procID);
        }

        Process process = null;
        try {

            if(logger.isDebugEnabled()){
                logger.debug("[{}] OCR process info : Excute OCR Command.",this.procID);
            }
            process = pb.start();

            int w = process.waitFor();
            if(logger.isDebugEnabled()){
                logger.debug("[{}] OCR process info : Command excute result >>> {}",this.procID,w);
            }else if(logger.isInfoEnabled()){
                logger.info("[{}] OCR process info : Command excute result >>> {}",this.procID,w);
            }

            if (w == 0) {
                txt_path = this.ocr_result_file_path;
                //處理空格
                this.processSpace(txt_path);
            } else {
                //打印錯誤日誌
                this.printCommandError(process);
                String msg = "[%s] excute command Faild ! Result %d , Reason : %s !";
                switch (w) {
                    case 1:
                        msg = String.format(msg,this.procID,w,"沒法訪問文件，可能文件名中存在空格等特殊字符");
                        break;
                    case 29:
                        msg = String.format(msg,this.procID,w,"沒法識別圖像或其選定區域");
                        break;
                    case 31:
                        msg = String.format(msg,this.procID,w,"不支持的圖片格式");
                        break;
                    default:
                        msg = String.format(msg,this.procID,w,"未知錯誤");
                }
                throw new RuntimeException(msg);
            }

        } catch (IOException e) {
            logger.error("[{}] OCR process info : Command excute [pb.start()] Faild !",this.procID,e);
        } catch (InterruptedException e) {
            logger.error("[{}] OCR process info : Command excute [process.waitFor()] Faild !",this.procID,e);
        }

        if(logger.isDebugEnabled()){
            logger.debug("[{}] OCR process info : OCR Process End.",this.procID);
        }else if(logger.isInfoEnabled()){
            logger.info("[{}] OCR process info : OCR Process End.",this.procID);
        }

        return txt_path;
    }

}

4.1 測試

import org.apache.logging.log4j.core.config.ConfigurationSource;
import org.apache.logging.log4j.core.config.Configurator;
import org.junit.Before;
import org.junit.Test;

import java.io.File;
import java.io.FileInputStream;
import java.io.FileNotFoundException;

public class TestFileOCR {

    private static String log4jxml = "/your/log/config/path/log4j2.xml";

    @Before
    public void before() throws FileNotFoundException {
        File config=new File(log4jxml);
        ConfigurationSource source = new ConfigurationSource(new FileInputStream(config),config);
        Configurator.initialize(null, source);
    }

    public void test_JPG_Linux(){
        String tessPath = "/your/tesseract/path/Tesseract-OCR";
        String img = "/your/test/file/path/test.jpg";

        OcrProcess ocr = new OcrProcess(tessPath);

        //測試保存其餘目錄
//        try {
//            ocr.setSaveDir("/others/save/path");
//        } catch (FileNotFoundException e) {
//            e.printStackTrace();
//        }
        File file = new File(img);
        try {
            String path = ocr.doOCR(file, "chi_sim");
            System.out.println(path);
        }catch (Exception e){
            e.printStackTrace();
        }
    }

    public void test_JPG_Windows(){

        String tessPath = "D:\\your\\tesseract\\path\\Tesseract-OCR";
        String img = "C:\\your\\test\\file\\path\\test.jpg";

        OcrProcess ocr = new OcrProcess(tessPath);
        //測試保存其餘目錄
        try {
            ocr.setSaveDir("C:\\others\\save\\path");
        } catch (FileNotFoundException e) {
            e.printStackTrace();
        }
        File file = new File(img);
        try {
            String path = ocr.doOCR(file, "chi_sim");
            System.out.println(path);
        }catch (Exception e){
            e.printStackTrace();
        }
    }
}

5. 第三方 JAR

若是以爲使用命令行方式不方便，可使用 Tess4j, Maven 引入以下。

<dependency>
    <groupId>net.sourceforge.tess4j</groupId>
    <artifactId>tess4j</artifactId>
    <version>3.4.3</version>
</dependency>

Tess4j在tesseract 之上豐富了圖像的處理，能夠放大後識別，還有一些學習功能，可是在測試過程當中，tess4j 在處理時比直接使用命令行的內存消耗要多點。

6. 實際應用

OCR 是一個特別消耗內存的操做，建議作成組件獨立部署，與您的服務經過 API 來調用。

如下是個人應用的一些數據，OCR 爲獨立部署。僅供參考：

內存消耗
1. 開5個線程同時處理，內存持續維持在8-12G 之間。
2. 單線程維持在 3-4G 之間。
識別效率
1. 100張 A4 紙張內容，300kb 每張，單線程，處理時間爲20-24分鐘，12~16秒一張
2. 10000張 A4 紙張內容，300kb 每張，5個線程，處理時間爲7-10小時，中間失敗概率會增大，重試次數多。