目錄python
這個谷歌的識別項目早就據說了,使用以後發現,真的很厲害。寫下初次簡單使用的過程吧。git
谷歌的開源識別項目
我下了這兩個,chi是擴展的識別中文須要,只安裝.exe便可,而後配置環境變量github
C:\Users\27569>tesseract Usage: tesseract --help | --help-extra | --version tesseract --list-langs tesseract imagename outputbase [options...] [configfile...] OCR options: -l LANG[+LANG] Specify language(s) used for OCR. NOTE: These options must occur before any configfile. Single options: --help Show this help message. --help-extra Show extra help for advanced users. --version Show version information. --list-langs List available languages for tesseract engine.
使用python調用測試,windows下,我記得我程序第一次是不通的,後來改了tesseract文件的源碼的某個路徑才成功運行的windows
requirment.txt
pillow pytesseract
run.py
import io import re import pytesseract from PIL import Image class Ocr: def __init__(self): self.day_re = re.compile('(\d{4}-\d{2}-\d{2})') self.daytime_re1 = re.compile('(\d{2}:\d{2})') self.daytime_re2 = re.compile('(\d{2}:\d{2}-\d{2}:\d{2})') def prepare_img(self, img): """圖片預處理,提升識別率""" img = img.convert('L') threshold = 200 # 根據狀況來定,127 table = [] for i in range(256): if i < threshold: table.append(0) else: table.append(1) return img.point(table, '1') def ocr(self, img): """識別""" img = self.prepare_img(img) return pytesseract.image_to_string(img, lang='eng', config='psm 7') # lang: eng 英文, chi_sim 中文(須要訓練庫) if __name__ == '__main__': c = Ocr() with open('0.jpg', 'rb') as f: image_binary = f.read() byte_arr = io.BytesIO(image_binary) # Image.open() 打開圖片的第一種方式 img = Image.open(byte_arr) print(c.ocr(img)) # Image.open() 打開圖片的第二種方式 img = Image.open('0.jpg') print(c.ocr(img))