Tesseract的OCR引擎最早由HP實驗室於1985年開始研發,至1995年時已經成爲OCR業內最準確的三款識別引擎之一。然而,HP不久便決定放棄OCR業務,Tesseract也從今後塵封。html
數年之後,HP意識到,與其將Tesseract束之高閣,不如貢獻給開源軟件業,讓其重煥新生--2005年,Tesseract由美國內華達州信息技術研究所得到,並求諸於Google對Tesseract進行改進、消除Bug、優化工做。java
在修復了最重要的數個漏洞後,Google認爲,Tesseract OCR已經足夠穩定,能夠從新以開源軟件方式發佈。git
Tesseract GitHub 地址github
TESSDATA_PREFIX
, 值就是您 Tesseract 的路徑,好比D:\abc\def\Tesseract-OCR
不安裝vc 2015 會出現 gs 等找不到的錯誤web
不配置環境變量,在使用時,會出現
Please make sure the TESSDATA_PREFIX environment variable is set to the parent directory of your "tessdata" directory
錯誤apache
若是在安裝 vc 2015 出現
0x80240017
-未指定的錯誤,請參考此處解決windows
:: cd 到對應目錄 tesseract sourcefile.jpg savename -l chi_sim
使用 Homebrew
安裝 tesseract
便可,brew 會自動安裝依賴包。api
brew install tesseract
tesseract sourcefile.jpg savename -l chi_sim
若是不是 maven 項目跳過此步驟便可app
<properties> <java.version>1.8</java.version> <log4j2.version>2.1</log4j2.version> </properties> <dependencies> <dependency> <groupId>junit</groupId> <artifactId>junit</artifactId> <version>4.12</version> <scope>test</scope> </dependency> <!--日誌包--> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-core</artifactId> <version>${log4j2.version}</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-api</artifactId> <version>${log4j2.version}</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-web</artifactId> <version>${log4j2.version}</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-slf4j-impl</artifactId> <version>${log4j2.version}</version> </dependency> <dependency> <groupId>org.apache.logging.log4j</groupId> <artifactId>log4j-jcl</artifactId> <version>${log4j2.version}</version> </dependency> <dependency> <groupId>org.apache.directory.studio</groupId> <artifactId>org.apache.commons.lang</artifactId> <version>2.6</version> </dependency> </dependencies>
具體的 log4j2 配置,可參考: Log4j1 升級 Log4j2 實戰dom
<?xml version="1.0" encoding="UTF-8"?> <Configuration status="warn" monitorInterval="600"> <Properties> <property name="BASE_LOG_PATTERN">%5p %d{yyyy-MM-dd HH:mm:ss.SSS} [%t] (%class{36}:%L)%M - %m%n</property> <property name="LOG_DIR_HOME">logs</property> <property name="BASE_LOG_FILENAME">nuna-ocr</property> </Properties> <Appenders> <Console name="Console" target="SYSTEM_OUT"> <ThresholdFilter level="trace" onMatch="ACCEPT" onMismatch="DENY" immediateFlush="true"/> <PatternLayout pattern="${BASE_LOG_PATTERN}" /> </Console> <RollingRandomAccessFile name="stdout_appender" immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout.log" filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-stdout-%d{yyyy-MM-dd}_%i.log.gz"> <PatternLayout> <pattern>${BASE_LOG_PATTERN}</pattern> </PatternLayout> <Policies> <TimeBasedTriggeringPolicy modulate="true" interval="1"/> <SizeBasedTriggeringPolicy size="5120 KB"/> </Policies> <DefaultRolloverStrategy max="3"/> </RollingRandomAccessFile> <RollingRandomAccessFile name="error_appender" immediateFlush="true" fileName="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error.log" filePattern="${LOG_DIR_HOME}/${BASE_LOG_FILENAME}-error-%d{yyyy-MM-dd}_%i.log.gz"> <PatternLayout> <pattern>${BASE_LOG_PATTERN}</pattern> </PatternLayout> <Policies> <TimeBasedTriggeringPolicy modulate="true" interval="1"/> <SizeBasedTriggeringPolicy size="5120 KB"/> </Policies> <Filters> <ThresholdFilter level="warn" onMatch="ACCEPT" onMismatch="DENY"/> </Filters> <DefaultRolloverStrategy max="3"/> </RollingRandomAccessFile> </Appenders> <Loggers> <logger name="com.liu.app.ocr" level="debug" additivity="false" > <appender-ref ref="Console" /> <appender-ref ref="stdout_appender" /> <appender-ref ref="error_appender" /> </logger> <root level="info" includeLocation="true"> <appender-ref ref="Console" /> <appender-ref ref="stdout_appender" /> <appender-ref ref="error_appender"/> </root> </Loggers> </Configuration>
import java.io.*; import java.util.*; import java.util.ArrayList; import java.util.List; import org.apache.commons.lang.StringUtils; import org.apache.logging.log4j.LogManager; import org.apache.logging.log4j.Logger; public class OcrProcess { private static Logger logger = LogManager.getLogger(OcrProcess.class); private String tessPath = ""; private String ocr_psm_num = "3"; private String ocr_language = "eng"; private String ocr_result_path = ""; private String ocr_result_file_path = ""; private final String OS = System.getProperty("os.name"); private final String ocr_lang_option = "-l"; public final String ocr_psm_option = "-psm"; private long procID = 0; /** * 文本換行符 */ private final String EOL = System.getProperty("line.separator"); /** * 當前系統,路徑分割符 */ private final String FIS = System.getProperties().getProperty("file.separator"); /** * 構造函數 * @param tessPath 本地 tesseract 路徑 */ public OcrProcess(String tessPath) { this.tessPath = tessPath; if(logger.isDebugEnabled()){ logger.debug("OcrProcess Created ."); logger.debug("OcrProcess Current OS >>> {} ",this.OS); logger.debug("OcrProcess User tessPath >>> {} ",this.tessPath); }else if(logger.isInfoEnabled()){ logger.info("OcrProcess Created ."); } } /** * 設置 pagesegmode 1-10 * 0 = Orientation and script detection (OSD) only. * 1 = Automatic page segmentation with OSD. * 2 = Automatic page segmentation, but no OSD, or OCR. * 3 = Fully automatic page segmentation, but no OSD. (Default) * 4 = Assume a single column of text of variable sizes. * 5 = Assume a single uniform block of vertically aligned text. * 6 = Assume a single uniform block of text. * 7 = Treat the image as a single text line. // 識別內容爲 橫行, 能夠提供單行文本的識別效率 * 8 = Treat the image as a single word. * 9 = Treat the image as a single word in a circle. * 10 = Treat the image as a single character. * @param psm 參考 tesseract 文檔 */ public void setPageSegMode(Integer psm){ if(logger.isDebugEnabled()){ logger.debug("OcrProcess User set OCR Process PageSeqMode >>> {}",psm); }else if(logger.isInfoEnabled()){ logger.info("OcrProcess User set OCR Process PageSeqMode >>> {}",psm); } if(psm == null){ throw new IllegalArgumentException("param psm is null,will use default 3"); } if(psm > 10 || psm < 0){ throw new IllegalArgumentException("param psm only between 0 and 10,will use default 3"); } this.ocr_psm_num = String.valueOf(psm); } /** * 設置保存路徑 * @param savePath */ public void setSaveDir(String savePath) throws FileNotFoundException { if(logger.isDebugEnabled()){ logger.debug("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath); }else if(logger.isInfoEnabled()){ logger.info("OcrProcess User set OCR Process Result TXT SavePath >>> {}",savePath); } File saveDir = new File(savePath); if(!saveDir.exists()){ throw new FileNotFoundException("the savePath is not found!"); } this.ocr_result_path = savePath; } /** * * @author alexliu * @date:2017年11月22日 下午2:39:42 * @Description:文件 ocr * @param file 文件 * @param language 語言 chi_sim ,eng * @return 識別後文件路徑 * @throws NoSupportFileTypeException 自定義異常 */ public String doOCR(File file , String language) throws NoSupportFileTypeException { // 建立一個 ocr 執行id,便於日誌、數據記錄 //this.procID = OcrTools.createProcessId(); this.procID = new Random().nextInt(1000); if(logger.isDebugEnabled()){ logger.debug("OcrProcess Begin. The file >>> [{}], the procID is [{}] .",file.getName(),this.procID); }else if(logger.isInfoEnabled()){ logger.info("OcrProcess Begin. The file >>> [{}], the procID is [{}] . See More info PLZ set logger Debug or Trace.",file.getName(),this.procID); } String textSavePath = ""; if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : FilePath >>> {}",this.procID,file.getAbsolutePath()); logger.debug("[{}] OCR process info : FileName >>> {}",this.procID,file.getName()); logger.debug("[{}] OCR process info : FileSize >>> {} KB",this.procID,(file.length() / 1024f)); logger.debug("[{}] OCR process info : OCR Language >>> {}",this.procID,language); } this.ocr_language = language; //獲取文件類型 int fileType = this.getFileType(file.getName()); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : FileType >>> {}",this.procID,fileType); } if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : The FileType is Image .",this.procID); } textSavePath = callTesseractCommand(file); return textSavePath; } /** * 獲取輸出路徑 * 若是 用戶調用了 `setSaveDir` ,那麼 OCR 輸出到用戶設置的目錄 * 若是 沒有調用 `setSaveDir` ,那麼 OCR 輸出與識別文件同一目錄 * @param sourceFile 原始文件 * @return */ private String getOutPutDir(File sourceFile){ if(StringUtils.isEmpty(this.ocr_result_path)){ return sourceFile.getAbsolutePath().substring(0,sourceFile.getAbsolutePath().lastIndexOf(this.FIS)); } return this.ocr_result_path; } /** * 獲取文件類型 * @param fileName * @return * @throws NoSupportFileTypeException 自定義異常 */ private Integer getFileType(String fileName) throws NoSupportFileTypeException { //此處根據文件後綴名判斷是不是能夠執行 OCR 的文件類型 //Integer type = OcrTools.getFileType(fileName); Integer type = null; if(type == null){ // NoSupportFileTypeException 爲自定義異常,此處可自定義您的內容 //throw new NoSupportFileTypeException("["+fileName+"] , 沒法識別的文件類型."); } return type; } /** * 建立 ProcessBuilder * @param sourceFile * @return */ private ProcessBuilder createProcessBuilder(File sourceFile){ ProcessBuilder pb = new ProcessBuilder(); if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){ //設置命令行工做目錄,Linux , Mac OS 設置在 tesseract 目錄下 //由於 tesseract 非安裝模式,也沒有添加到系統的環境變量中 pb.directory(new File(this.tessPath)); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,this.tessPath); } }else if(this.OS.startsWith("Windows")){ //設置命令行工做目錄,windows 下設置在要解析的文件目錄下 pb.directory(sourceFile.getParentFile()); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : Set ProcessBuilder working dir >>> {}",this.procID,sourceFile.getParentFile()); } } //輸出錯誤日誌流 pb.redirectErrorStream(true); return pb; } /** * 建立命令行 * @param sourceFile * @return */ private List<String> createCommand(File sourceFile){ List<String> cmd = new ArrayList<String>(); //ocr 輸出目錄 String result_out_dir = getOutPutDir(sourceFile); //ocr 命令輸出文件名,沒有後綴 String ocr_result_filename = sourceFile.getName().substring(0 , sourceFile.getName().lastIndexOf(".")); //ocr 輸出文件名,有後綴 String result_out_filePath = result_out_dir + this.FIS + ocr_result_filename + ".txt"; this.ocr_result_file_path = result_out_filePath; if(logger.isDebugEnabled()) { logger.debug("[{}] OCR process info : Result SaveDir >>> {}",this.procID,result_out_dir); logger.debug("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath); }else if(logger.isInfoEnabled()){ logger.info("[{}] OCR process info : Result SaveFilePath >>> {}",this.procID,result_out_filePath); } if(this.OS.startsWith("Mac OS") || this.OS.startsWith("Linux")){ cmd.add("tesseract"); // Linux or Mac 要設置工做目錄爲 tesseract 目錄 ,因此`原始文件名`需包含路徑 cmd.add(sourceFile.getAbsolutePath()); }else if(this.OS.startsWith("Windows")){ cmd.add(this.tessPath + this.FIS + "tesseract"); // windows 要設置工做目錄爲文件目錄,因此`原始文件名`沒有路徑,只有文件名 cmd.add(sourceFile.getName()); } cmd.add(result_out_dir + this.FIS + ocr_result_filename); cmd.add(this.ocr_psm_option); cmd.add(this.ocr_psm_num); cmd.add(this.ocr_lang_option); cmd.add(this.ocr_language); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString()); }else if(logger.isInfoEnabled()){ logger.info("[{}] OCR process info : The Command >>> {}",this.procID,cmd.toString()); } return cmd; } /** * 處理識別後的空格字符 * @param txtFilePath * @throws FileNotFoundException */ private void processSpace(String txtFilePath) throws FileNotFoundException { FileInputStream fis = null; InputStreamReader isr = null; BufferedReader br = null; String str; FileOutputStream fos = null; OutputStreamWriter osw = null; BufferedWriter bw = null; boolean readSuccess = true; File txtFile = new File(txtFilePath); if(!txtFile.exists()){ throw new FileNotFoundException("OCR process Result file is not exist!"); } StringBuilder sb = new StringBuilder(); try { //讀取文件 fis = new FileInputStream(txtFile); isr = new InputStreamReader(fis,"UTF-8"); br = new BufferedReader(isr); while ((str = br.readLine()) != null) { sb.append(str).append(this.EOL); } } catch (Exception e){ logger.error("[{}] OCR process space read file faild .",this.procID,e); readSuccess = false; } finally { try { br.close(); }catch (Exception e){ //ignore; } try { isr.close(); }catch (Exception e){ //ignore; } try { fis.close(); }catch (Exception e){ //ignore; } } //處理空格 if(readSuccess){ try { //寫出文件 fos = new FileOutputStream(txtFile); osw = new OutputStreamWriter(fos,"UTF-8"); bw = new BufferedWriter(osw); bw.write(sb.toString().replaceAll(" ", "")); } catch (Exception e){ logger.error("[{}] OCR process space write file faild .",this.procID,e); } finally { try { bw.close(); }catch (Exception e){ //ignore; } try { osw.close(); }catch (Exception e){ //ignore; } try { fos.close(); }catch (Exception e){ //ignore; } } } } /** * 打印命令行執行錯誤日誌 * @param process */ private void printCommandError(Process process){ InputStream fis = null; InputStreamReader isr = null; BufferedReader br = null; try { // 取得命令結果的輸出流 fis = process.getInputStream(); // 用一個讀輸出流類去讀 isr = new InputStreamReader(fis); // 用緩衝器讀行 br = new BufferedReader(isr); String line = null; // 直到讀完爲止 while ((line = br.readLine()) != null) { logger.warn("[{}] OCR process warning : {} ",this.procID,line); } } catch (Exception e){ logger.warn("[{}] OCR process print command error Faild!",e); } finally { try { br.close(); } catch (Exception e){ //ignore } try { isr.close(); } catch (Exception e){ //ignore } try { fis.close(); } catch (Exception e){ //ignore } } } /** * * @author alexliu * @date:2017年11月22日 下午2:47:25 * @Description:用命令行執行ocr * @param imageFile * @return 返回識別後的文件路徑目 * @throws Exception */ private String callTesseractCommand(File imageFile) { String txt_path = ""; List<String> comand = this.createCommand(imageFile); ProcessBuilder pb = this.createProcessBuilder(imageFile); //添加命令行 pb.command(comand); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : Add command to ProcessBuilder",this.procID); } Process process = null; try { if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : Excute OCR Command.",this.procID); } process = pb.start(); int w = process.waitFor(); if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : Command excute result >>> {}",this.procID,w); }else if(logger.isInfoEnabled()){ logger.info("[{}] OCR process info : Command excute result >>> {}",this.procID,w); } if (w == 0) { txt_path = this.ocr_result_file_path; //處理空格 this.processSpace(txt_path); } else { //打印錯誤日誌 this.printCommandError(process); String msg = "[%s] excute command Faild ! Result %d , Reason : %s !"; switch (w) { case 1: msg = String.format(msg,this.procID,w,"沒法訪問文件,可能文件名中存在空格等特殊字符"); break; case 29: msg = String.format(msg,this.procID,w,"沒法識別圖像或其選定區域"); break; case 31: msg = String.format(msg,this.procID,w,"不支持的圖片格式"); break; default: msg = String.format(msg,this.procID,w,"未知錯誤"); } throw new RuntimeException(msg); } } catch (IOException e) { logger.error("[{}] OCR process info : Command excute [pb.start()] Faild !",this.procID,e); } catch (InterruptedException e) { logger.error("[{}] OCR process info : Command excute [process.waitFor()] Faild !",this.procID,e); } if(logger.isDebugEnabled()){ logger.debug("[{}] OCR process info : OCR Process End.",this.procID); }else if(logger.isInfoEnabled()){ logger.info("[{}] OCR process info : OCR Process End.",this.procID); } return txt_path; } }
import org.apache.logging.log4j.core.config.ConfigurationSource; import org.apache.logging.log4j.core.config.Configurator; import org.junit.Before; import org.junit.Test; import java.io.File; import java.io.FileInputStream; import java.io.FileNotFoundException; public class TestFileOCR { private static String log4jxml = "/your/log/config/path/log4j2.xml"; @Before public void before() throws FileNotFoundException { File config=new File(log4jxml); ConfigurationSource source = new ConfigurationSource(new FileInputStream(config),config); Configurator.initialize(null, source); } public void test_JPG_Linux(){ String tessPath = "/your/tesseract/path/Tesseract-OCR"; String img = "/your/test/file/path/test.jpg"; OcrProcess ocr = new OcrProcess(tessPath); //測試保存其餘目錄 // try { // ocr.setSaveDir("/others/save/path"); // } catch (FileNotFoundException e) { // e.printStackTrace(); // } File file = new File(img); try { String path = ocr.doOCR(file, "chi_sim"); System.out.println(path); }catch (Exception e){ e.printStackTrace(); } } public void test_JPG_Windows(){ String tessPath = "D:\\your\\tesseract\\path\\Tesseract-OCR"; String img = "C:\\your\\test\\file\\path\\test.jpg"; OcrProcess ocr = new OcrProcess(tessPath); //測試保存其餘目錄 try { ocr.setSaveDir("C:\\others\\save\\path"); } catch (FileNotFoundException e) { e.printStackTrace(); } File file = new File(img); try { String path = ocr.doOCR(file, "chi_sim"); System.out.println(path); }catch (Exception e){ e.printStackTrace(); } } }
若是以爲使用命令行方式不方便,可使用 Tess4j, Maven 引入以下。
<dependency> <groupId>net.sourceforge.tess4j</groupId> <artifactId>tess4j</artifactId> <version>3.4.3</version> </dependency>
Tess4j在tesseract 之上豐富了圖像的處理,能夠放大後識別,還有一些學習功能,可是在測試過程當中,tess4j 在處理時比直接使用命令行的內存消耗要多點。
OCR 是一個特別消耗內存的操做,建議作成組件獨立部署,與您的服務經過 API 來調用。
如下是個人應用的一些數據,OCR 爲獨立部署。僅供參考:
內存消耗
識別效率
廣告欄: 歡迎關注個人 我的博客