中文識別 tesseractor

時間 2019-11-24

標籤中文識別 tesseractor 简体版

原文原文鏈接

ubuntu安裝tesseract
sudo apt-get install tesseract-ocr
默認沒有安裝識別中文語言，因此要想識別中文就要安裝chi_sim庫（只裏就是隻簡單方法）

安裝中文庫chi_sim
sudo apt-get install tesseract-ocr-chi-sim # 這裏是chi-sim 不是下劃線哦！！！

python

ubuntu下安裝很是簡單，不過速度可能比較慢。git

$ apt-get install tesseract-ocr

默認的安裝目錄是：/usr/share/tesseract-ocr/，後面要安裝中文文字庫，則就在該目錄下的 tessdata/ 文件夾。github

CentOS 下編譯安裝：

能夠參考下面這篇文章上：ubuntu

http://blog.csdn.net/diandianxiyu_geek/article/details/50522582

由於個人依賴都是裝好的，若是發現有依賴沒裝，則參考以下的依賴列表：bash

sudo apt-get install g++ 
sudo apt-get install autoconf automake libtool
sudo apt-get install autoconf-archive
sudo apt-get install pkg-config
sudo apt-get install libpng12-dev
sudo apt-get install libjpeg8-dev
sudo apt-get install libtiff5-dev
sudo apt-get install zlib1g-dev

### 訓練工具須要下面的依賴

sudo apt-get install libicu-dev
sudo apt-get install libpango1.0-dev
sudo apt-get install libcairo2-dev

除了上面的依賴，你還須要編譯安裝 Leptonica：工具

$ wget http://www.leptonica.org/source/leptonica-1.72.tar.gz
 $ tar xvzf leptonica-1.72.tar.gz
 $ cd leptonica-1.72/
 $ ./configure
 $ make && make install

完成了 Leptonica 以後，下載 tesseract，而後進入 tesseract 目錄：字體

$ ./configure && make && make install

以後進行 tessdata 字體庫的安裝：spa

以中文字體庫舉例，下載改字體庫（下面有連接）。.net

$ wget https://github.com/tesseract-ocr/langdata/tree/master/chi_sim](https://github.com/tesseract-ocr/langdata/tree/master/chi_sim

將字體庫放在 /usr/local/share/tessdata/ 文件夾下便可code

調用方式

import pytesseract from PIL import Image # open image image = Image.open('test.png') code = pytesseract.image_to_string(image, lang='chi_sim') print(code)

相關標籤/搜索

每日一句

每一个你不满意的现在，都有一个你没有努力的曾经。