使用C#版Tesseract庫

上一篇介紹了Tesseract庫的使用(OCR庫Tesseract初探),文末提到了Tesseract是用c/c++開發的,也有C#的開源版本,本篇介紹一下如何使用C#版的Tesseract。php

C#版本源碼下載地址:https://github.com/charlesw/tesseracthtml

其實在vs中能夠直接用NuGet工具進行下載:c++

打開nuget,搜索tesseract,點安裝便可。git

 

源碼是vs2015編譯的,須要安裝vs2015以上版本。github

打開項目後如:算法

咱們再添加一個winform項目,畫界面如:工具

實現點擊「選擇須要識別的圖片」,打開一張圖片,調用算法並顯示結果。比較簡單。源碼以下:post

using System;
using System.Collections.Generic;
using System.ComponentModel;
using System.Data;
using System.Drawing;
using System.Linq;
using System.Text;
using System.Threading.Tasks;
using System.Windows.Forms;
using Tesseract;

namespace TesseractDemo
{
    public partial class Form1 : Form
    {
        public Form1()
        {
            InitializeComponent();
        }
        //選圖片並調用ocr識別方法
        private void btnRec_Click(object sender, EventArgs e)
        {
            //openFileDialog1.Filter = "";
            if (openFileDialog1.ShowDialog() == DialogResult.OK)
            {
                var imgPath = openFileDialog1.FileName;
                pictureBox1.Image=Image.FromFile(imgPath);
                string strResult = ImageToText(imgPath);
                if (string.IsNullOrEmpty(strResult))
                {
                    txtResult.Text = "沒法識別";
                }
                else
                {
                    txtResult.Text = strResult;
                }
            }
        }
        //調用tesseract實現OCR識別
        public string ImageToText(string imgPath)
        {
            using (var engine = new TesseractEngine("tessdata", "eng", EngineMode.Default))
            {
                using (var img = Pix.LoadFromFile(imgPath))
                {
                    using (var page = engine.Process(img))
                    {
                        return page.GetText();
                    }
                }
            }
        }
    }
}

 有一點要注意的是,tesseract的識別語言包要本身下載後包含到項目裏面,並設置爲始終複製,或者直接把這個文件包放到運行程序目錄(bin\debug)下:url

eng是英文字符的意思,要識別其餘語言字符,須要本身下載:spa

Tesseract has unicode (UTF-8) support, and can recognize more than 100 languages "out of the box".

這個庫支持100種語言的識別

字庫下載地址爲:https://github.com/tesseract-ocr/tessdata

用OpencvSharp先降噪再調OCR識別:

//用opencv進行降噪處理再ocr識別
        private void button3_Click(object sender, EventArgs e)
        {
            //從網上讀取一張圖片
            string imgUrl = "https://service.cheshi.com/user/validate/validatev3.php";
            MemoryStream ms = ReadImgFromWeb(imgUrl);
            Image img = Image.FromStream(ms);
            pictureBox1.Image = img;

            //降噪
            Mat simg = Mat.FromStream(ms, ImreadModes.Grayscale);
            Cv2.ImShow("Input Image", simg);
            //閾值操做 閾值參數能夠用一些可視化工具來調試獲得
            Mat ThresholdImg = simg.Threshold(29, 255, ThresholdTypes.Binary);
            Cv2.ImShow("Threshold", ThresholdImg);
            Cv2.ImWrite("d:\\img.png", ThresholdImg);
            
            textBox1.Text= ImageToText("d:\\img.png");
        }
        
        /// <summary>
        /// 從網上讀取一張圖片
        /// </summary>
        /// <param name="Url"></param>
        public MemoryStream ReadImgFromWeb(string Url)
        {
            HttpWebRequest request = (HttpWebRequest)WebRequest.Create(Url);
            request.Credentials = CredentialCache.DefaultCredentials; // 添加受權證書
            request.UserAgent = "Microsoft Internet Explorer";
            WebResponse response = request.GetResponse();
            Stream s = response.GetResponseStream();
            byte[] data = new byte[1024];
            int length = 0;
            MemoryStream ms = new MemoryStream();
            while ((length = s.Read(data, 0, data.Length)) > 0)
            {
                ms.Write(data, 0, length);
            }
            ms.Seek(0, SeekOrigin.Begin);
            //pictureBox1.Image = Image.FromStream(ms);
            return ms;
        }

請自行用NuGet程序下載opencvsharp3.0庫,參考http://www.javashuo.com/article/p-bjpqrfoj-dc.html

相關文章
相關標籤/搜索