本篇記錄下python識別圖片中的文字 所需的安裝配置;python
Tesseract-OCR這個軟件是由Google維護的開源的OCR軟件。git
下載地址:https://github.com/tesseract-ocr/tesseract/wiki/Downloadsgithub
下載後安裝後,將Tesseract-OCR路徑加入系統path;app
安裝時注意勾選簡體中文,默認安裝,安裝完畢後,敲命令(看看裝的怎麼樣了,支持什麼語言):spa
tesseractcode
tesseract -vblog
tesseract --list-langs #查看Tesseract-OCR支持語言seo
中文字庫chi_sim.traineddata圖片
下載地址:https://github.com/tesseract-ocr/tesseract/wiki/Data-Filesip
將中文字庫放在\Tesseract-OCR\tessdata文件夾裏面;
改文件:
C:\Python3\Lib\site-packages\pytesseract\pytesseract.py(根據實際路徑修改),找到這兩行:
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY tesseract_cmd = 'tesseract'
改成這樣:
# CHANGE THIS IF TESSERACT IS NOT IN YOUR PATH, OR IS NAMED DIFFERENTLY #tesseract_cmd = 'tesseract' tesseract_cmd = 'C:/Program Files (x86)/Tesseract-OCR/tesseract.exe'
代碼:
(寫幾個字,截圖保存成:1.png)
import pytesseract from PIL import Image text = pytesseract.image_to_string(Image.open('1.png'), lang='chi_sim') print(text)