字符和文檔識別的四十年研究react
---工業前景的瞻望git
文檔來源:http://www.sciencedirect.com/science/article/pii/S0031320308000964web
文章歷史:算法
Received15 February 2008spring
Receivedin revised form 10 March 2008數據庫
Accepted11 March 2008編程
摘要:本文簡要介紹在過去的40年中字符和文檔識別領域的技術進步,對於每十年中的表明進行了簡單的闡述。而後,對漢字識別的關鍵技術的發展進行了主要闡述。大量的篇幅內容討論了穩魯棒性設計原則——已經被證明是用來解決郵政地址識別等複雜問題的有效途徑,包括假說驅動的原則,的遞延決定/多個假設的原則,信息集成原則,替代解決方案的原則和擾動的原則等。最後,對將來的預測,長尾理論,並對新的應用進行了討論。api
©2008 Elsevier Ltd. All rights reserved.數組
關鍵字:OCR;字符識別;手寫識別;漢字識別;郵政地址識別;穩健性;魯棒性設計;信息集成;假說驅動方法;數碼筆promise
1前言
本文是根據手寫漢字識別大賽中的一些資料以工業的視角來描述字符和文檔識別技術的。在20世紀50年代出現了商業光學字符閱讀器(OCRs),也就是從那時開始,在整個開發過程當中,字符和文檔識別技術的進步提供了先進的產品和系統,以知足工業和商業的須要。與此同時,企業基於此技術的利潤使得他們在這方面的投入更多的資金開發更先進的技術。在這裏咱們能夠觀察到一個良性的循環。新技術已經促進了新的應用,新的應用支持發展更好的新技術。字符和文檔識別是模式識別上一個很是成功的領域。
在過去的四十年的字符和文檔識別的主要商業及工業應用一直在閱讀方面,銀行支票的閱讀和郵政地址讀。經過支持這些應用,識別能力在多個方面展開:寫做模式,腳本類型的文件,識別模式等。寫做機印刷,手工印刷,手寫腳本。可識別的腳本開始用阿拉伯數字和擴展的拉丁字母、日文片假名的音節字符、漢字字符(中國字的日文版)、中國字,和韓文字符。
如今作的工做,是爲了使印度和阿拉伯腳本的可讀性。今天的光學字符識別系統,能夠讀取不少種類的紙張表格,包括銀行支票、明信片、信封、書頁和名片等。OCR-A和OCR-B字體字體標準,已經使光學字符識別即便是在早期階段也足夠可靠。在相同狀況下,特別設計的OCR形式簡化了分割的問題,即便是不成熟的識別技術。今天的光學字符識別成功地用來讀取任何類型的字體和自由手寫字符。
字符和文檔識別領域的發展並不老是一路順風的,曾經兩次被新的數字技術浪潮所動搖。宣稱要減弱OCR技術的做用。第一次這樣的浪潮是在20世紀80年代早期的自動化辦公領域。今後之後,大部分信息彷佛要「原生態數字」。逐漸的想減小OCRS的現實需求,而且一些研究人員對將來OCR的發展存在消極的心理。可是,事實證實,在日本銷售字符識別產品在20世紀80年代達到頂峯。具備諷刺意味的是這個現象這是因爲推廣引進辦公室的電腦致使的。使用的紙張已不斷增長已是顯而易見的了。
咱們如今面臨着第二波浪潮。IT和網絡技術的可能會對OCR產生不一樣程度的影響。許多應用程序能夠在網絡上完成,能夠即時分享世界各地的信息。可是,仍然不知道對於字符和文檔的識別的需求是否會減小或者是否須要更先進的技術將建立新的應用技術。搜索引擎已經爲無處不在的到的圖像文件,照片和視頻提供搜索,並擴大其搜索範圍。人們正在從新評估手寫的重要性,並試圖將其集成到數字世界。因此紙張仍然不會消失不見。如今的CPU可以實時識別帶有微型攝像機的移動設備,這裏咱們將討論將來的發展前景。
2.發展史回顧
2.1概述
第一個實用的 OCR產品在20世紀50年代出如今美國,也是在同一時期出現了第一臺商用計算機UNIVAC。今後之後,每隔十年,都會看到OCR技術的長足進步。20世紀60年代初期,IBM公司生產出了他的第一款光學閱讀機,IBM1418(1960)和IBM1428(1962年),分別可以閱讀印刷的數字和手寫的數字。那個時候的識別機器能夠讀取200個打印文檔的字體,並做爲IBM1401計算機的輸入設備。除此以外,在20世紀60年代,郵政業實現了利用光學字符識別的自動化信件分揀機,這是有史以來第一次自動識別郵政編碼來肯定目的地。美國郵政首次引入識別地址的的OCR系統是在1965年開始閱讀城市/州/ZIP的印刷信封[2]。在日本,東芝和NEC開發了郵政編碼識別的手寫數字字符識別系統,並把它們投入使用。在德國,郵政編碼系統於1961年在世界上首次引入[4]。可是,郵政編碼閱讀信件分揀機在歐洲於1973年在乎大利的第一個字母自動分揀機的地址閱讀器於1978年引入德國[5]。
日本於20世紀60年代開始引入商業光學字符識別系統。日立公司在1968年製造了他們字符識別系統用於印刷字符(包括數字和字母),並在1972年將手寫數字識別系統應用於商業用途。此外NEC在1976年開發出了第一個能夠識別讀取手寫名片的OCR系統。模式信息處理開始於1971年的日本國際貿易和工業部(後更名爲經濟,貿易和工業部)進行了爲期10年的耗資200億日元國家項目。其餘研究課題上,東芝在漢字識別印刷上、富士通在手寫字符識別上分別進行研究。做爲這個項目的一部分,有助於漢字光學字符識別[6]。Asaby產品的研究和發展,包括漢字字符的ETL字符數據庫建立,和模式識別領域的項目,都吸引了許多學生和研究人員。在美國,IBM公司於1971年推出了存款處理系統(IBM3895),可以識別沒有任何限制的手寫字體支票金額。筆者有幸在1981年在美國匹茲堡的梅隆銀行參觀了機器的運行操做。聽說,他能夠閱讀50%左右的手寫支票,剩下的一半是由手工編碼識別的。最早進的字符識別是在20世紀60年代和20世紀70年代,而且是有資料可查的。
20世紀80年代在半導體技術發生着顯著的進步,如CCD圖像傳感器、微處理器、動態隨機存儲器(DRAM)以及本身設計的LSI。例如,光學字符識別系統變得更小更適合桌面辦公(fig.1),還有愈來愈便宜的兆字節存儲器和CCD圖像傳感器,使掃描的圖像被整頁的存儲到存儲器來進行進一步的處理,從而使更先進的的識別技術和更普遍的應用領域。例如,1983年第一次出現的手寫數字識別系統能夠識別字符。使寫做能夠沒有物理形式的約束。在20世紀80年代後期, 日本廠商的光學字符識別引入到本身的產品線,能夠識別約2400印刷和手寫漢字字符。這些軟件用於讀取數據輸入的姓名和地址。更詳細的技術審查,可在文獻[9,10]。
20世紀80年代是辦公自動化的繁榮時期,在日本影響下,有兩個特色,其中之一是日本語言處理的計算機和日語文字處理器的出現,和漢字是這種發展的一個天然結果。另外一個特色是用做計算機的存儲系統,在20世紀80年代初,該系統進行了開發和投入使用光盤。一個典型的應用是在美國和日本的專利自動化系統的專利說明書文件存儲的圖像。日本專利的FICE系統,能夠存儲大約50萬份文件到200萬12英寸光盤的數字化頁面上。每一個磁盤能夠存儲7GB的數據,至關於20萬數字化的頁面。系統使用80日立光盤的單位和80個光盤庫單位。這些系統能夠被認爲是第一個數字圖書館。這種新的計算機應用直接和間接鼓勵對文件的理解和文件奠基了在日本的分析研究。更重要的是,在這十年中,該應用第一次成爲計算機處理的焦點。
在20世紀90年代的變化是因爲UNIX工做站和我的電腦的升級性能致使的。仍然由硬件完成掃描和圖像預處理,識別的重要組成部分通常由的計算機上的軟件實現。這是可使用編程語言得,如C和C++代碼能夠編寫識別算法,這讓讓更多的工程師開發出更復雜的算法,並擴大包括學術界的社會研究。在這十年中,OCR商業軟件在我的電腦上運行的程序包也出如今市場上。自由手寫字符識別技術進行了普遍的研究,併成功應用於銀行支票識別機和郵政地址識別機。先進的佈局分析技術,使得能識別更普遍的商業信息載體類型。如CENPARMI。在這一領域研究的專門機構教授Srihari領導和教授Govindaraju這些進步作出了貢獻。新的高科技廠商,包括A2iA,這是由已故教授西蒙開始在法國[11],和Parascript,一位在開始在俄羅斯和美國作生意的人。在日本,日本郵政省在1994年和1996年進行的第三代郵政自動化等多個項目,其中東芝,NEC,日立加入到開發郵政地址識別系統,能夠進行排序的序列。本項目使日本地址識別系統獲得了的顯着進步。
國際模式識別協會在20世紀90年代初開始舉行會議,如ICDAR,IWFHR,DAS。據報道,在這些會議深刻討論了的最新的研究方法。如人工神經網絡,隱馬爾可夫模型(HMM模型),多項式函數分類,改進二次判別函數(MQDF)分類[12],支持向量機(SVM),分類組合[13- 15],信息集成,和詞彙字符串識別[16-19],其中有一些獨到的看法的是創建在原有的基礎上的。從20世紀60年代[20,21]。這些技術在今天的系統中發揮着關鍵的做用。與前幾十年相比,大多數行業使用專有的點播技術,上世紀90年代經歷了重要的學術界和工業界之間的互動學習。學術研究解決了現實的技術難題,開發了先進的理論爲基礎的方法,使行業受益於他們的研究。讀者可能會發現字符識別系統,包括圖像預處理,特徵提取,模式識別,和單詞識別的狀態,在文獻[22]中有詳細的論述。
在下面的小節中,主要描述20世紀90年代之前的漢字字符分類,字符分割算法和語言處理領域的技術成果。
2.2漢字的分類
在20世紀70年代,文字識別有兩個相互不一樣的方式,結構分析和模板匹配(或統計方法)。當代商業光學字符識別用結構化的方法來讀取手寫數字,字母和片假名用模板匹配的方法來讀取印刷的字母數字。20世紀70年代後期模板匹配的方法已被實驗證實尤爲適用於打印的漢字[23-26],但其對於手寫的或打印的手寫漢字是有問題的。手寫漢字識別的問題看起來是很嚴峻的,一個未開發的領域。很明顯,不管是結構性的,仍是簡單的模板匹配方法可以解決。前者有與數量龐大拓撲變化的困難和複雜的結構,然後者則有非線性形狀變化的困難。然而,在之前的工做中,對於手寫數字識別,使用模板匹配的方法,彷佛有更大的成功機會[27]。
模糊特徵提取概念問題的關鍵,是定向特徵的應用,找出有效識別手寫漢字的方法[27,28]。連續空間特徵提取的引進使最佳模糊量特別的大。日立OCR閱讀手寫漢字時在四組的16×16陣列的灰度值上使用簡單的模板匹配即基於模糊方向特徵的特徵模板。定向模板匹配,這是在1979年日本的專利。使用一個二維的梯度計算來肯定中風的方向(Fig.2),甚至適用於灰度圖像[29]。雖然只是間接相關Hubel和Wiesel的工做在鼓勵咱們認爲,定向的特色是有但願的[30]。非線性形狀正常化[31-33]和統計分類方法[12,34]可以提升識別的準確率。咱們瞭解到,模糊應被視爲一種減小計算成本的一種方法,而不是得到潛在的尺寸(子空間)的手段,雖然效果可能看起來差很少。例如,網眼大小爲8×8中使用統計方法肯定光香農採樣定理的最佳模糊參數,然而較大的篩目,而且具備相同的模糊參數的方法卻沒有顯示出的識別性能。
由Kimura教授領導小組的深刻研究,提升了統計的二次分類[12],已成功應用於手寫漢字識別。事實上,如今已知的基本理論,在20世紀70年代的計算機中沒有足夠的計算能力來完成這樣的統計方法的研究。今天,四方向的特徵矢量能夠表示由8×8×4元素組成的漢字模式,統計協變分析經過如下方式得到從100到第140的尺寸的子空間。然而,8×8陣列在許多複雜的日文漢字字符中大小是超乎想象的(反直覺)。個別自由手寫漢字識別精度還不夠高。所以,語言類字符的狀況下,如名稱和地址,還要提升識別的準確性。爲了下降計算成本,基於集羣的兩階段分類是用於減小必須匹配的模板的數目。識別的漢字(中文字符)的最新進展之一是識別引擎的尺寸減少,尤爲是針對移動電話應用而設計。在文獻報道的緊湊型識別引擎[35,36],只須要613KB的內存來存儲參數就能識別4344箇中國漢字印刷的字符。
2.3字符分割算法
在20世紀60年代和70年代,飛點掃描儀或激光掃描儀的旋轉鏡都使用一個光電倍增管,將光信號轉換成電信號。字符分割一般藉助於這些種掃描機制。例如,當掃描被閱讀者使用標記字符時,有邊緣的信號字符線存在的狀況。此外,寫表格上的框的位置必須被預先登記,和框的顏色對於掃描儀傳感器是透明的。所以,光學字符識別,能夠很容易地提取包含一個單一的手寫字符的圖像。
而後,在20世紀80年代,半導體傳感器和存儲器AP的出現,使光學字符識別掃描和存儲整個頁面的圖像。這是一個劃時代的里程碑,對用戶具備重要意義,由於它減弱了OCR形式規範嚴格的條件,例如,經過讓他們使用更小的的非分離寫做盒。可是,這須要瞭解圖像如何在存儲器[37]存儲的解決方案。在此以前的變化,已掃描圖像的二進制像素數組,分割是基於像素的,可是從這個時候起,在內存中的二進制圖像的遊程編碼表示。適合的遊程表示鏈接的COM組件進行分析和輪廓跟蹤。鏈接的部件進行了處理,而不是像素的黑色物體。1983年,日立生產的第一個光學字符識別,能夠部分之一,並認可觸摸基於多假設分割-手寫體數字識別方法(圖3)。輪廓形狀分析,以肯定候選人的感應點,和多對強制分開的圖案被送入分類器。經過統計的值分類,可以選擇正確的假設。這個方向的變化,致使咱們的處理,其最終目標是要讀的未知形式,或至少是形式,並不是專爲光學字符識別。然而,這意味着用戶可能在他們的寫做變得不那麼當心,因此光學字符識別對於自由手寫字符有更準確的識別。
分割的問題是比郵政地址識別更爲困難的。圖4顯示了水平手寫的地址,一個字符的寬度的變化有兩個因素,和一些自由基和組件也是有效字符。如圖4中所示。它很難以合適的組件進行分組,以造成正確的字符模式,其中有些字符是至關寬的,還有特別窄的。爲了解決分組問題,語言的信息(或住址知識)還須要另外的幾何結構上類似的信息。此問題將在第3節中詳細地討論。
2.4語言信息的整合
使用手寫漢字的光學字符識別的主要目的是閱讀申請表格的名稱和地址。在這樣的應用中,以免分割問題,設置了預印固定的盒子。可是,如何實現高度精確的詞/短語識別仍然是一個問題。
咱們能夠利用先驗的語言知識,從候選格準確地識別單詞和短語中選擇正確的選項。在這裏,晶格是一個表,其中每一列表明一個候選類,每一行對應於在其上的字符。若是一個字符串包含N個漢字字符,併爲每一個字符有K種可能,則會有K的N次方種可能的解釋(或字識別結果)。語言處理要從包含許多可能的解釋中選擇。要作到這一點,咱們開發出了一種基於有限狀態自動機的關鍵技術[38]。基本的想法是首先讓自動機識別比較大的字符,而後觀察哪一個被自動機接受,其中的自動機模型是動態生成的晶格(圖5)。一般L是一個數字就像萬級以上,但只有其第一個字符出如今第一列中的晶格纔會被接受。爲了提升準確性,咱們能夠考慮的第二個字符出如今第二列中的格子的規則。這樣方式是字符被逐個送入的自動機,和狀態變化來肯定一個路徑(一系列的邊緣)。而後,得出了相應的動做,以及和輸入項相關的選項。傳遞的第一條邊,爲0,並經過最後給出了斷定爲15時,K=16。在這種方式中,被肯定爲是一個術語的最小斷定識別的單詞。爲每一個字符的數目的候選被自適應地控制爲等於或小於K,就像toexclude等極不可能的候選字符。該算法已成功地使用,用於地址短語的識別,字符必須是可靠的分段。 marukawa等人的實驗代表,在10828個字符的一個詞典中,字符識別的準確度從90.2%提升至99.7%,99.1%地址短語的識別精度。在這裏,咱們能夠注意到,錯誤發生的統計是獨立的。語言處理解決了難以分割的問題(參見圖4),在第3節中討論。
3處理不肯定性和可變性的魯棒性設計
郵政地址識別是衆多技術的中和挑戰,在這個意義上,對於研究人員是一種比較理想的應用,可是,與此同時,做爲郵政辦公自動化的投資的回報就是大量的創新。在20世紀90年代,在美國,歐洲和日本的R&D項目開發地址的讀者,能夠識別自由手寫和印刷的完整地址。這些載體序列排序,郵政工人單調乏味的任務實現自動化。識別任務是充分認識到的目標地址,包括街道和公寓,以肯定確切的交貨點。日本的問題是肯定一個40000000地址點。在本節中,做者的研究小組[39,40]的經驗的基礎上,處理的不肯定性和可變性的魯棒性設計的主要問題進行了討論。
日本語地址識別是一個困難的事情,做爲示於圖6。印刷和手寫郵件的閱讀率較高,分別超過90%和70%。被拒絕的郵件片的圖像被送到視頻編碼站的人工操做,輸入地址信息。自動識別和人類編碼的結果被轉換爲地址代碼,而後將其噴灑於相應的郵件片,由於它們運行經過分揀機後的地址碼映射到示出了載波的序列的號碼,經過使用兩通基數排序方法,郵件片能夠按順序進行排序。
識別系統由一個高速掃描儀,圖像預處理硬件,進行地址塊的位置的計算機軟件,字符的行分割,字符的字符串識別(即地址短語的解釋),字符分類和後處理(佈局分析圖7)。框圖中能夠看出,有許多模型形成了複雜的字符斷定,即,不肯定性將始終參與。解決具體問題的算法很容易使圖像發生改變,因此最根本的問題是如何處理的不肯定性和可變性,以及如何植入魯棒性到系統中。一個更合適的問題多是如何組成一個識別系統識別模塊,或者是如何鏈接這些模塊。
在回答這些問題時,應該認識到有基本的設計原則來指導研究人員和工程師。咱們能夠把他們叫作魯棒性的設計原則。表1列出了它們,並給出了簡單的解釋。在下面的章節中,將對5個這樣的原則進行了討論。
3.1假設驅動的原則
可變性意味着沒有任何一種解決方案適合全部的狀況。所以,問題必須常常被分爲必定數目的狀況下每一種狀況有不一樣的解決方案(問題的解算器)。然而,在輸入問題是未知的狀況下,假說驅動的原則,就能夠應用於在這種狀況。日本地址塊識別的問題,就是一個例子。佈局有6個基本類型,但在現實生活中,其實有12個類型,由於信封,有時上下顛倒使用。咱們採起的方法是選擇的顯着特徵來區分這種狀況,每一種狀況下的觀測值等顯著特色的基礎上來評估。
做爲一個整體框架的假設驅動的方法,我所觀察到的顯着特色證據的狀況下,一個統計假設檢驗方法可用於評估的可能性,咱們稱之爲一個假設。後驗機率,對於這一假說。能夠計算爲:例子(1),Hk表明第K個假設,ek表明表明第K個假設的特徵向量。在公式(1)中,L是假設Hk的一個比值,計算方法以下公式(2),假設特徵是靜態的。函數P(Eki|Hk)和函數P(Eki|Hk)能夠從訓練中得到。
所以,對於全部的假設,{Ek|k=1.2.3…..K} 使得計算L(ek|Hk)和P(Hk|ek)成爲可能。這樣就能夠找到最匹配的假設。
在假設驅動的方法中,在肯定對象的假設後,被調用來處理輸入的解決方案只適用於相應輸入的情形。
3.2遞延決定/多假設原則
在一個複雜的模式識別系統中,必須作出許多判斷才能獲得最終的結果。與往常同樣,每個識別是否是100%的準確,因此決策模塊不能簡單地級聯。每一個模塊不該該草率的做出決定,應該暫緩決定而且提出多種假設輸入到下一個模塊。這個概念自己是很簡單的。好比郵政地址識別,能夠有不少功能模塊以下所示:
•方向檢測線
•字符大小(大/小)的測定
•字符線的造成和提取
•地址塊識別
•字符的類型(machine-printed/handwritten類型)識別
•腳本(漢字/假名)識別
•文字方向識別
•字符分割
•字符分類
•單詞識別
•短語解釋
•地址號碼識別
•建築/室內數字識別
•收件人姓名識別
•最終決策(接受/拒絕/重試)
表格1魯棒設計原則
規則 |
|
指望 |
P1 |
假設驅動的原則 |
當類型的一個問題是不肯定的,創建假設,並對其進行測試 |
P2 |
遞延決定/多種假設原則 |
決定不離開決定將來的專家進行多個假設 |
P3a |
流程整合 |
做爲一個團隊,由多個不一樣領域的專家,解決問題 |
P3b |
信息集成原則 結合集成 |
多個專家做爲一個團隊 |
P3c |
「佐證」爲基礎的整合 |
使用其餘輸入信息,尋求更多的證據 |
P4 |
替代解決方案的原則 |
由多個備選辦法解決問題 |
P5 |
攝動原理 |
稍微修改問題,而後再試一次 |
這些功能模塊,被轉發到下一個模塊,它生成多重假設,而後再次生成多個假設。所以,此過程建立種的假設示於圖8的層次結構樹。這裏的問題是如何按照在最短的時間內達到最好的答案時獲得最佳的分支機構。在知名的搜索方法中,咱們基本上是用帶回溯的搜索方式,經過它咱們能夠在最短的時間內獲得最佳的解決方案。當它具備的置信度值小於預先設定的閾值,最佳的分支在稍後階段就會被拒絕,以及其餘的分支都是這樣的處理。Beam搜索在後期的使用能有效地提升識別精度,而其在早期階段的使用則過於耗時。控制搜索的數量的假設是很是重要的,這是由於在咱們的例子中,計算的時間被限制在3.7秒有限的時間內,來對時間和精度之間進行權衡。固然時間越短越好,由於它須要更少的計算量。
3.3信息集成原則
咱們知道字符和文檔識別領域不肯定性的三種信息集成方式:(1)流程整合,(2)組合整合,(3)佐證爲基礎的整合。第一種方法,流程整合,集成了兩個或三個過程,以造成一個解決單一問題的能力。例如分割-識別和分割 -識別 -解釋的方法。這種方法早在20世紀70年代出如今該地區的語言理解方面。第二個以組合爲基礎整合方式是一個字符分類和分類組合器的集成[13-15]。不一樣的分類,在指望的分類,行爲互補,例如統計和結構分類器和神經網絡相結合(集成)來推導出一個結果。這種方法被稱爲多數投票和Dempster-Shafer方法,可用算法實現。最後,佐證爲基礎的集成方法,是尋找更多支持結果的證據或多個相同的信息輸入來源。一個很好的例子是讀出的銀行支票金額確認的數字、郵政地址識別、郵政編碼和地址短語在詞中的讀取來得到準確的結果。收件人姓名識別是另外一個例子佐證。採起這種方法時,街道號碼不會被確認。
在郵政地址識別中,最重要的須要考慮的因素就是對於字符分割、字符分類、短語解釋(語言處理)這三個階段的信息整合。就像前面章節所描述的那樣,地址識別必須解決在字符分割上含糊不清和類似的問題。因此,對於複雜的多假設模型來講簡單的應用程序是遠遠不夠的。一種叫作詞彙引導或者詞彙驅動的方式被認爲是多假設驅動方法。該方法如圖9所示。其中輸入被經過搜索路徑的預分割網絡所解釋(圖10),在網絡中的表示語言知識最佳的路徑相匹配如圖11。咱們能夠說,這是至關於語言網絡中的路徑搜索最匹配的路徑的預分段網絡[18,19]。這種解釋知識的識別過程在西蒙[43]給出的解釋之中:
當他在一個語義豐富的領域解決問題的時候,在內存中必須保存長期的記憶,很大一部分的問題的搜索是在大量存儲信息的指導下發現的。
在咱們這裏所表示的情形,長期記憶是指大量的語言知識,而短時間知識是指預分段網絡所獲得的結論。
咱們已經開發出了這樣的算法,圖12是多個版本中的一個,由Liu等人提出[19]。詞典驅動的手寫地址識別算法的識別率爲83.7%與1.1%的偏差,這是使用3589實際的郵件和一個包含111349地址短語的詞典作的一個實驗。束搜索方法和搜索控制以TRIE結構的語言模型爲表明。使用了奔騰III/600MHz機器的確認時間爲100ms左右。
3.4替代解決方案的原則
還有許多包括圖像級別的問題,包括感應字符、感應下劃線,窗口陰影噪聲,取消郵票/感應地址字符等問題,。另外一種解決方案是對同一個問題提供多個解決方案。它有效地提供解決方案的互補性。例如,對於感應字符,使用全面的方法或被迫分離方法(二分法)可能會解決問題。特別是在處理數字方面,一對字符能夠被視爲一個字符的100個種類。總體性和二分法的分類結果等全面的分類,合併產生更可靠的識別結果。另外一個替代的解決方案的例子,是用來解決窗口噪音問題。當窗口的噪音存在時,兩個解決問題的方式是必要的。假設陰影薄或微弱的,一個試圖經過侵蝕(變薄)操做消除這種噪音。假設陰影是至關堅實的,嘗試提取線段造成一個框架。但願這兩個問題解決中的其中一個會成功的。
3.5攝動原理
攝動的原則是在解決一個問題有困難的時候或者須要再次解決問題的時候來適應問題的輕微變化。若是模式識別是一個持續的過程,攝動原理是行不通的。然而,在現實中,它一般是一個不連續的過程。很是小的改變可能會致使最終的識別結果的變化。但願這種變化是從排斥到正確的一個過程,或從錯誤到正確的識別。在20世紀80年代,就使用結構化方法來識別手寫體數字。因爲輕微的拓撲變化引發的排斥反應,每次的擾動參數或輸入圖像的變化可能會提升識別率。近年來對系統的深刻研究再次顯示了該方法的有效性。各類輸入圖像的變換,如形態(膨脹/腐蝕)和幾何變換(旋轉,傾斜,角度,收縮和擴張),都會對圖像進行干擾。在Ha和Bunke的工做[44]中,手寫體數字轉換有12中不一樣的方法,並認可使用幀的分類相結合的工做。他們的作法,認可困難,偏愛手寫更好地比傳統的分類,如k-NN和神經網絡。順便說一下,模糊是圖像變換其中之一,但尚未被應用於上下文中的擾動之中。模糊中使用字符特徵提取的不是那種‘輕微的轉換’。
攝動原則也被成功地應用到日本的郵政地址識別。咱們測試的方法實現了改善約10-15個百分點的平均識別率。當咱們沒有識別時間限制,反覆擾動操做,包括旋轉變換,二值化,和其餘一些參數的修改序列,咱們發現,53%的被拒絕的圖像被正確識別了,只有12%的錯誤率。雖然結果是很是有吸引力的,減小額外的錯誤是使用這種方法的一個必要步驟。一種可能的方式去實現這種方法就是應用HA和Bunke[44]的組合方案。例如,通過一系列的拒絕算法、多重干擾機制得到最終結論而不是採用第一次的識別結果。鑑於現代計算機不斷強大的計算能力,這種作法彷佛是很是有但願和前途的。應該指出的是,擾動不只是有效的字符分類方法,也是對於佈局分析,直線提取,字符分割,及其餘中間決定頗有效的方式。
3.6魯棒性實驗
在前面的小節中描述的設計原則,涉及到結構和算法識別系統,分類算法及各類參數都必須謹慎,同時進行必須的訓練和調整[40]才能夠。即便爲特定的解決問題的模塊也一樣如此。雖然是新手,不少問題是過程當中的發展階段。所以,魯棒性的實現,對於研究人員和工程師來講是一項艱鉅的任務。如下是高效率和有效的發展過程當中的關鍵。
•用戶網站現場的樣品
•魯棒性測量的試驗樣品使用了許多'bag'
•加速數據集
•採樣樣本的緣由分析
若是可能的話,從用戶的站點中收集樣品是很是可取的。咱們把這些實際樣品叫作現場樣本。然而,當是在多個會話中收集的樣本集時候,現場的樣品不該該被混合成一個單一的樣品。重要的是要選擇合適的場合來採集樣品,由於樣品的特性的不一樣的運做模式和季節性傾向也會有所不一樣。沒有被混合的序列集 ,咱們能夠保持許多不一樣的'袋'的樣本。可爲每一個袋測量識別率(或識別的精確度),如圖所示13。在這裏,圖中的一各分支致使是該數據組號被從新排列,以便以識別率遞減的順序從新排列。一個陡峭的斜坡意味着識別系統,是不穩定的。此外,若是一個數據集的識別性能是很是低的,咱們能夠從新審視更加詳細的數據,這是體積太小緣由所形成的(即低識別率)。
加速數據集是已被拒絕或辨識錯誤的識別樣本的集合。每個樣本中的數據集能夠被賦予一個惟一的標識符,該樣品能夠進行採樣樣本的緣由分析,更重要的是,經過改進能夠追溯到整個開發過程。若是名稱和問題的代碼能夠被肯定到具體的問題的狀況下,能夠更適當地管理從糾正過程當中非直接的進程。
4.前景
一個40-50年OCR歷史的概述,講述了了目前的市場會出現的全部人最成熟的技術和觀點。然而,很明顯,該技術仍然在發展之中,遠遠知足不了於人類的認知需求。從技術成熟的角度來看,目前的狀態是市場(或應用)的一部分。根據這種觀點,市場「頭」的部分有少許的應用可是卻有大量的文件內容須要識別。他們是商業形式的閱讀,銀行支票的識別和郵政地址識別。他們的投資總算是獲得了應有的回報。或者投資回報率已經幾乎老是有的。固然,技術的進步已把頭部的一部分朝向尾部延伸,但餘下的尾部是很長的。考慮部件的頭三個應用領域也有尾巴部分。有不少的商業形式如支票和郵件上,是很是難以閱讀的。更先進的識別技術無疑是急切需求的。例如,小到大中型企業(SME)在日本仍然使用紙作銀行交易以及紙張收入形式向當地政府提交報告。每個這樣的公司進行交易的話,那麼全部的公司加起來的數量是很是大的。所以,要接受各類形式的銀行等公司,使用更智能、多功能的光學字符識別是一個很大的問題。長尾現象,適用於郵政地址識別。問題是若是需求方能夠預見在擬議的新產品和新系統的投資回報率,科學家和工程師若是能說服他們,那麼技術問題是多樣的。這就是典型的長尾問題。
從不一樣的角度來談論這個問題的時候,就像雞和蛋哪一個先有同樣的問題,或須要和種子的問題,通常狀況下這是很難回答的。從行業的角度來看,彷佛更重要的是考慮需求,或至少是潛在的需求,至少在目前主觀上時須要的。充分認識到當前未知足的的需求,其中包括:(1)電子政務,辦公文件檔案(2)手寫的移動設備的人機接口,(3)視頻搜索視頻中的文本(4)書籍和歷史文獻的全局搜索。還有兩個其餘應用程序:(5)文本在現場的信息採集和(6)手寫文檔管理工做者。
未知標識和未知的語言在道路上對於旅遊者迅速作出決定是一大障礙。在商店、在機場等使用移動設備與一臺數碼相機,即信息獲取攝像頭[45]在這種狀況下有助於他們(圖14)。隨着更高的高性能的微處理器的出現,在場景中的文本能夠被充分的識別。該技術困難,包括彩色圖像處理,幾何角度常態化,文本分割,自適應閾值,未知的語言識別,語言翻譯,等等。在日本的每個移動電話配備都有數碼相機,其微處理器傳感器也很強大。一些數碼相機如今有定位人臉圖像中的要採起信息的智能功能。如今的問題是,爲何是文字識別這麼很難。在日本一些手機如今能夠識別超過4000個漢字字符[36]。彷佛動態的識別能力是一個更有趣的挑戰,經過複雜的狀況下相機拍攝的圖像識別出多個用戶的自覺行動,確保高識別性能。用戶能夠嘗試不一樣的角度和位置,瞄準識別的目標進行測試。它能夠被認爲是互動的干擾識別。
另外一個比較有吸引力的領域是數碼筆和手寫文件管理工具。手寫的行爲正在從新考慮它在教育重要性和工做內容的基礎上。用今天的數碼筆寫做能夠幫助人們讀,寫,記憶。它能夠很是天然的方式捕獲手寫批註和備忘錄,咱們也能夠將這些行爲歸入信息系統。TheAnoto是其中的一個先進的技術,能夠數字捕捉手寫筆畫數據及其它相關數據(圖15)。有研究小組正在使用這種數字筆來建立更智能的信息管理系統[46- 49]。他們的目標是精確地管理數字墨水的文檔。一組倡導「即時信息」(iJIT)的研究人員開發的試驗系統,支持他們的筆記和混合文檔管理[49]。他們的手寫的研究筆記本能夠始終保持與他們在電腦中的數字模式兼容。經過這種方式,即便他們相距較遠,他們也能夠輕鬆地共享組中的信息。該系統的另外一個特徵是,用戶用印刷文件的數字筆書寫的數字文檔,用這樣一種方法打印任何形式的文檔(圖16)。換句話說,數字文檔的內容用Anoto點覆蓋。所以,手寫筆劃就能夠被捕獲,用戶能夠標記和打印輸出這些註解,並與已經存在的計算機同步相應的文檔。這種系統的價值在於一個數字文件在計算機裏有相同的註解。這意味着他們能夠扔掉的紙張文件,任何狀況下沒有任何的信息損失。這個概念使用戶在在數字世界和現實世界中的工做中一樣出色。這是一個企圖超越神話的一個無紙化辦公系統[50]。當這樣的數字筆的使用成爲一種廣泛的現象的時候,手寫字符識別,手寫查詢處理,更智能的知識管理能力的要求等等將是一個很天然的需求。咱們着力打造信息系統,識別技術是咱們追求的一種方式。咱們但願看到更多的先進信息系統須要更先進的識別技術。
5.結論
前瞻性的眼光和基本技術都是咱們的技術社區將來發展的關鍵。前瞻性預示着應用的價值和定位。對於在新技術的投資,就這種新的技術,吸引力着須要許多人甚至是一些創新型人才的進入。這是一個TOP-DOWN的創新方式。基本技術創新要從基礎的地方開始。在這裏,咱們所討論的技術有兩部分組成:一個是技術,從底部基本的地方支持咱們的社會,另外一種是咱們本身的,即字符和文檔識別技術。在第一部分,咱們已經看到了先進的半導體器件,高性能計算機,以及更先進的軟件開發工具,以及支持識別技術的對咱們生活形成的影響。他們不只使表面上的更先進的OCR系統,同時也邀請和推進學術界進入了這個領域,更有助於識別技術的進步。咱們願意看到這種良性循環一直進行下去。
致謝:
本文做者感謝在日立公司進行郵政地址識別系統開發的小組成員:H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K.Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T.Takahashi。感謝來自中國的C-L劉博士在北京自動化研究所作的大量工做。還有東京明成大學的Y.Shima博士在咱們實驗室的工做。還要感謝G博士對咱們很是有價值的意見。還要感謝西門子的U.Miletzki博士ElectroCom提供有關的發展進程。
參考文獻:
[1]H. Fujisawa, A view on the past and future of character and documentrecognition, in: Proceedings of the Seventh ICDAR, Curitiba, Brazil,September 2007, pp. 3--7.
[2]The United States Postal Service: An American History 1775--2002,Government Relations, United States Postal Service, 2003.
[3]H. Genchi, K. Mori, S. Watanabe, S. Katsuragi, Recognition ofhandwritten numeral characters for automatic letter sorting, Proc.IEEE 56 (1968) 1292--1301.
[4]W. Schaaf, G. Ohling, et al., Recognizing the Essentials, SiemensElectroCom, Konstanz, 1997.
[5]http://www.industry.siemens.com/postal-automation/usa.
[6]K. Yamamoto, H. Yamada, T. Saito, I. Sakaga, Recognition ofhandprinted characters in the first level of JIS Chinese characters,in: Proceedings of the Eighth ICPR, 1986, pp. 570--572.
[7]J.R. Ullmann, Pattern Recognition Techniques, Butterworths, London,1973.
[8]C.Y. Suen, M. Berthod, S. Mori, Automatic recognition of handprintedcharacters---The state of art, Proc. IEEE 68 (4) (1980) 469--487.
[9]S. Mori, C.Y. Suen, K. Yamamoto, Historical review of OCR researchand development, Proc. IEEE 80 (7) (1992) 1029--1058.
10]G. Nagy, At the frontiers of OCR, Proc. IEEE 80 (7) (1992)1093--1100.
11]J.C. Simon, Off-line cursive word recognition, Proc. IEEE 80 (7)(1992) 1150--1161.
12]F. Kimura, K. Takashina, S. Tsuruoka, Y. Miyake, Modified quadraticdiscriminant functions and the application to Chinese characterrecognition, IEEE Trans. PAMI 9 (1) (1987) 149--153.
13]C.Y. Suen, C. Nadal, T.A. Mai, R. Legault, L. Lam, Recognition oftotally unconstrained handwritten numerals based on the concept ofmultiple experts, in: Proceedings of the First IWFHR, Montreal,Canada, 1990, pp. 131--143.
14]L. Xu, A. Krzyzak, C.Y. Suen, Methods of combining multipleclassifiers and their applications to handwriting recognition, IEEETrans. SMC 22 (3) (1992) 418--435.
15]T.K. Ho, J.J. Hull, S.N. Srihari, Decision combination in multipleclassifier systems, IEEE Trans. PAMI 16 (1) (1994) 66--75.
[16]F. Kimura, M. Sridhar, Z. Chen, Improvements of lexicon-directedalgorithm for recognition of unconstrained hand-written words, in:Proceedings of the Second ICDAR, Tsukuba, Japan, October 1993, pp.18--22.
[17]C.H. Chen, Lexicon-driven word recognition, in: Proceedings of theThird ICDAR, Montreal, Canada, August 1995, pp. 919--922.
[18]M. Koga, R. Mine, H. Sako, H. Fujisawa, Lexical search approach forcharacter-string recognition, in: Proceedings of the Third DAS,Nagano, Japan, November 1998, pp. 237--251.
[19]C.-L. Liu, M. Koga, H. Fujisawa, Lexicon-driven segmentation andrecognition of handwritten character strings for Japanese addressreading, IEEE Trans. PAMI 24 (11) (2002) 425--1437.
[20]M. Aizermann, E. Braverman, L. Rozonoer, Theoretical foundations ofthe potential function method in pattern recognition learning,Automat. Remote Control 25 (1964) 821--837.
[21]U. Miletzki, Schürmann-polynomials---roots and offsprings, in:Proceedings of the Eighth IWFHR, 2002, pp. 3--10.
[22]M. Cheriet, N. Kharma, C.-L. Liu, C.Y. Suen, Character RecognitionSystems---A Guide for Students and Practitioners, John Wiley &Sons, Inc., Hoboken, NJ, 2007.
[23]R. Casey, G. Nagy, Recognition of printed Chinese characters, IEEETrans. Electron. Comput. EC-15 (1) (1966) 91--101.
[24]S. Yamamoto, A. Nakajima, K. Nakata, Chinese character recognition byhierarchical pattern matching, in: Proceedings of the First IJCPR,Washington, DC, 1973, pp. 183--194.
[25]H. Fujisawa, Y. Nakano, Y. Kitazume, M. Yasuda, Development of aKanji OCR: an optical Chinese character reader, in: Proceedings ofthe Fourth IJCPR, Kyoto, November 1978, pp. 815--820.
[26]G. Nagy, Chinese character recognition: a twenty-five-yearretrospective, in: Proceedings of the Ninth ICPR, 1988, pp. 163--167.
[27]M. Yasuda, H. Fujisawa, An Improvement of Correlation Method forCharacter Recognition, vol. 10 (2), Systems, Computers, Controls,Scripta Publishing Co., 1979, pp. 29--38.
[28]H. Fujisawa, C.-L. Liu, Directional pattern matching for characterrecognition revisited, in: Proceedings of the Seventh ICDAR,Edinburgh, August 2003, pp. 794--798.
[29]H. Fujisawa, O. Kunisaki, Method of pattern recognition, JapanesePatent 1,520,768 granted in 1989, filed in 1979.
[30]D.H. Hubel, T.N. Wiesel, Functional architecture of macaque monkeyvisual cortex, Proc. R. Soc. London Ser. B 198 (1977) 1--59.
[31]J. Tsukumo, H. Tanaka, Classification of handprinted Chinesecharacters using non-linear normalization and correlation methods,in: Proceedings of the Ninth ICPR, Rome, Italy, 1988, pp. 168--171.
[32]C.-L. Liu, Normalization-cooperated gradient feature extraction forhandwritten character recognition, IEEE Trans. PAMI 29 (6) (2007)1465--1469.
[33]C.-L. Liu, Handwritten Chinese character recognition: effects ofshape normalization and feature extraction, in: Proceedings of theSummit on Arabic and Chinese Handwriting, College Park, September2006, pp. 23--27.
[34]K. Jain, R.P.W. Duin, J. Mao, Statistical pattern recognition: areview, IEEE Trans. PAMI 22 (1) (2000) 4--37.
[35]C.-L. Liu, R. Mine, M. Koga, Building compact classifier for largecharacter set recognition using discriminative feature extraction,in: Proceedings of the Eighth ICDAR, Seoul, Korea, 2005, pp.846--850.
[36]M. Koga, R. Mine, T. Kameyama, T. Takahashi, M. Yamazaki, T.Yamaguchi, Camera-based Kanji OCR for mobile-phones: practicalissues, in: Proceedings of the Eighth ICDAR, Seoul, Korea, 2005, pp.635--639.
[37]H. Fujisawa, Y. Nakano, K. Kurino, Segmentation methods for characterrecognition: from segmentation to document structure analysis, Proc.IEEE 80 (7) (1992) 1079--1092.
[38]K. Marukawa, M. Koga, Y. Shima, H. Fujisawa, An error correctionalgorithm for handwritten Chinese character address recognition, in:Proceedings of the First ICDAR, Saint-Malo, September 1991, pp.916--924.
[39]H. Fujisawa, How to deal with uncertainty and variability: experienceand solutions, in: Proceedings of the Summit on Arabic and ChineseHandwriting, College Park, September 2006, pp. 29--39.
[40]H. Fujisawa, Robustness design of industrial strength recognitionsystems, in: B.B. Chaudhuri (Ed.), Digital Document Processing: MajorDirections and Recent Advances, Springer, London, 2007, pp. 185--212.
[41]T. Kagehiro, H. Fujisawa, Multiple hypotheses document analysis, in:S. Marinai, H. Fujisawa (Eds.), Studies in ComputationalIntelligence, vol. 90, Springer, Berlin, Heidelberg, 2008, pp.277--303.
[42]T. Kagehiro, M. Koga, H. Sako, H. Fujisawa, Segmentation ofhandwritten Kanji numerals integrating peripheral information byBayesian rule, in: Proceedings of the IAPR MVA'98, Chiba, Japan,November 1998, pp. 439--442.
[43]H.A. Simon, The Sciences of the Artificial, third ed., The MIT Press,Cambridge, MA, 1998, pp. 87--88.
[44]T.M. Ha, H. Bunke, Off-line, handwritten numeral recognition byperturbation method, IEEE Trans. PAMI 19 (5) (1997) 535--539.
[45]H. Fujisawa, H. Sako, Y. Okada, S-W. Lee, Information capturingcamera and developmental issues, in: Proceedings of the FifthICDAR'99, Bangalore, September 1999, pp. 205--208.
[46]F. Guimbretière, Paper augmented digital documents, in: Proceedingsof the ACM Symposium on User Interface Software and Technology,UIST2003, Vancouver, Canada, 2003, pp. 51--60.
[47]C. Liao, F. Guimbretière, PapierCraft: a command system forinteractive paper, in: Proceedings of the ACM Symposium on UserInterface Software and Technology, UIST2005, Seattle, USA, 2005, pp.241--244.
[48]R. Yeh, C. Liao, S. Klemmer, F. Guimbretière, B. Lee, B. Kakaradov,J. Stamberger, A. Paepcke, ButterflyNet: a mobile capture and accesssystem for field biology research, in: Proceedings of theInternational Conference on Computer—Human Interaction, CHI2006,Montreal, Canada, 2006, pp. 571--580.
[49]H. Ikeda, K. Konishi, N. Furukawa, iJITinOffice: desktop environmentenabling integration of paper and electronic documents, in:Proceedings of the ACM Symposium on User Interface Software andTechnology, UIST2006, Montreux, Switzerland, October 2006.
[50]A.J. Sellen, R.H. Harper, The Myth of the Paperless Office, The MITPress, Cambridge, MA, 200
附錄:英文原文:
1.Introduction
Presentedis an industrial view on the character and document recognitiontechnology, based on some material presented at ICDAR [1]. Commercialoptical character readers (OCRs) emerged in the 1950s, and sincethen, the character and document recognition technology has advancedsignificantly providing products and systems to meet industrial andcommercial needs throughout the development process. At the sametime, the profits from businesses based on this technology have beeninvested in research and development of more advanced technology. Wecan observe here a virtuous cycle. New technologies have enabled newapplications, and the new applications have supported the developmentof better technology. Character and document recognition has been avery successful area of pattern recognition. The main business andindustrial applications of character and document recognition in thelast forty years have been in form reading, bank check reading andpostal address reading. By supporting these applications, recognitioncapability has expanded in multiple dimensions: mode of writing,scripts, types of documents, and so on. The recognizable modes ofwriting are machine-printing, handprint-ing, and script handwriting.Recognizable scripts started with Arabic numerals and expanded to theLatin alphabets, Japanese Katakana syllabic characters, Kanji(Japanese version of Chinese) characters, Chinese characters, andHangul characters. Work is now being done
tomake Indian and Arabic scripts readable. Many different kinds ofpaper forms can be read by today's OCRs, including bank checks, postcards, envelopes, book pages, and business cards. Typeface standardssuch as OCR-A and OCR-B fonts have contributed to making OCRsreliable enough even in the early stages. In the same context,specially designed OCR forms have simplified the segmentation problemand made handprinted character OCRs readable even by immaturerecognition technology. Today's OCRs are successfully used to readany type of fonts and freely handwritten characters. The field ofcharacter and document recognition has not always been peaceful. Ithas twice been disturbed by waves of new digital technologies thatthreatened to diminish the role of OCR technology. The first suchwave was that of office automation in the early 1980s. Starting then,most of information seemed to be going to be 'born digital',potentially diminishing demand for OCRs, and some researchers werepessimistic about the future. However, it turned out that the salesof OCRs in Japan, for example, peaked in the 1980s. This wasironically due to the promoted introduction of office computers. Itis well known that the use of paper has kept increasing. We are nowfacing the second wave. IT and Web technologies might have adifferent impact. Many kinds of applications can now be completed onthe Web. Information can flow around the world in an instant.However, it is still not known whether the demand for character anddocument recognition will decrease or whether new applicationsrequiring more advanced technology will be created. Search engineshave become ubiquitous and are expanding their reach into the areasof image documents, photographs, and videos. People are re-evaluatingthe importance of handwriting and trying to integrate it into thedigital world. It seems that paper is still not going to disappear.Mobile devices with micro cameras now have CPUs capable of real-timerecognition. The future prospects of these developments are discussedhere.
2.Brief historical view
2.1.Overview
Thefirst practical OCR appeared in the United States in the 1950s, inthe same decade as the first commercial computer UNIVAC. Since then,each decade has seen advances in OCR technology. In the early 1960s,IBM produced their first models of optical readers, the IBM 1418(1960) and IBM 1428 (1962), which were, respectively, capable ofreading printed numerals and handprinted numerals. One of the modelsof those days could read 200 printed document fonts and were used asinput apparatus for IBM 1401 computers. Also in the 1960s, postaloperations were automated using mechanical letter sorters with OCRs,which for the first time automatically read postal codes to determinedestinations. The United States Postal Service first introducedaddress-reading OCRs, which in 1965 began reading the city/state/ZIPline of printed envelopes [2]. In Japan, Toshiba and NEC developedhandprinted numeral OCRs for postal code recognition,
andput them into use in 1968 [3]. In Germany, a postal code system wasintroduced for the first time in the world in 1961 [4]. However, thefirst postal code reading letter sorter in Europe was introduced inItaly in 1973, and the first letter sorter with an automatic addressreader was introduced in Germany in 1978 [5].
Japanstarted to introduce commercial OCRs in the late 1960s.Hitachiproduced their first OCR for printed alpha numerics in 1968and thefirst handprinted numeral OCR for business use in 1972. NEC developedthe first OCR that could read handprinted Katakana in addition in1976. The Japanese Ministry of International Trade and Industry(since renamed the Ministry of Economy, Trade and Industry) conducteda 10-year 20 billion-yen national project on pattern in-formationprocessing starting in 1971. Among other research topics, Toshibaworked on printed Kanji recognition, and Fujitsu worked onhandwritten character recognition. The ETL character databasesincluding Kanji characters were created as part of this project,which contributed to research and development of Kanji OCRs [6].Asaby product, the project attracted many students and researchersinto the pattern recognition area. In the United States, IBMintroduced a deposit processing system (IBM 3895) in 1977, which wasable to recognize unconstrained handwritten check amounts. The authorhad a chance to observe it in operation at Mellon Bank in Pittsburghin 1981, and it could reportedly read about 50% of handwritten checkswith the remaining half being hand coded. The state of the art incharacter recognition in the 1960s and 1970s is well documented inthe literature [7,8].
The1980s witnessed significant technological advances in semi-conductordevices such as CCD image sensors, microprocessors, dynamic randomaccess memories (DRAMs), and custom-designed LSIs. For example, OCRsbecame smaller than ever fitting on desktops (Fig. 1). Then cheapermegabyte-size memories and CCD image sensors enabled whole-pageimages to be scanned into memory for further processing, in turnenabling more advanced recognition and
widerapplications. For example, handwritten numeral OCRs that couldrecognize touching characters were introduced for the first time in1983, making it possible to relax physical form constraints andwriting constraints. In the late 1980s, Japanese vendors of OCRsintroduced into their product lines new OCRs that could recognizeabout 2400 printed and handprinted Kanji characters. These were usedto read names and addresses for data entry. More detailed tech-nologyreviews are available in the literature [9,10].
Theoffice automation boom of the 1980s, which was influential in Japan,had two features. One was Japanese language processing by computersand Japanese word processors. Emergence of Kanji OCRs was a naturalconsequence of this development. The other feature was optical disksused as computer storage systems, which were developed and put intouse in the early 1980s. A typical application was patent automationsystems in the United States and Japan that stored images of patentspecification documents. The Japanese patent office system thenstored approximately 50 million documents or 200 million digitizedpages on 12-in optical disks. Each disk could store 7GB of data, theequivalent of 200000 digitized pages. The sys-tem used 80 Hitachioptical disk units and 80 optical library units. These systems can beconsidered one of the first digital libraries. This kind of newcomputer applications directly and indirectly encouraged studies ondocument understanding and document lay-out analysis in Japan. Moreimportantly, it was in this decade that documents became the focus ofcomputer processing for the first time.
Thechanges in the 1990s were due to the upgraded performances of UNIXworkstations and then personal computers. Though scanning and imagepreprocessing were still done by the hardware, a major part ofrecognition was implemented by the software on general-purposecomputers. The implication of this was that programming languageslike c and c ++ could be used to code recognition algorithms,allowing more engineers to develop more complicated algorithms andexpanding the research community to include academia. During thisdecade, commercial software OCR packages running on PCs also appearedon the market. Techniques for recognizing freely handwrittencharacters were extensively studied, and successfully applied to bankcheck readers and postal address readers. Advanced layout analysistechniques enabled recognition of wider varieties of business forms.Research institutions specializing in this field such as CENPARMI,led by Prof. Suen and CEDAR, led by Prof. Srihari and Prof.Govindaraju contributed to these advances. New high-tech vendorsappeared, including A2iA, which was started by the late Prof. Simonin France [11] , and Parascript, which was started in Russia to do
businessin the United States. In Japan, the Japanese Postal Ministryconducted the third generation postal automation project between 1994and 1996, in which Toshiba, NEC, and Hitachi joined to develop postaladdress recognition systems that could sort sequences. This projectenabled significant advances in Japanese address reading.
TheInternational Association for Pattern Recognition began holdingconferences such as ICDAR, IWFHR, and DAS in the early 1990s. Manyintensively studied methods have been reported in these conferences.Examples are artificial neural networks, hidden Markov models (HMMs),polynomial function classifiers, modified quadratic discriminantfunction (MQDF) classifiers [12] , support vector machines (SVMs),classifier combination [13--15] , information integration, andlexicon-directed character string recognition [16--19] , some ofwhich are based on original ideas from the 1960s [20,21]. Most ofthese play key roles in today's systems. In contrast with previousdecades, in which industry mostly used proprietary in-housetechnology, the 1990s witnessed important interactions betweenacademia and industry. Academics studied real technical problems anddeveloped sophisticated theory-based methods, enabling industry tobenefit from their research. Readers may find the state of the art ofcharacter recognition systems, including image preprocessing, featureextraction, pattern classification, and word recognition, welldescribed in the literature[22] .
Inthe following subsections, major pre-1990s technical achievements inthe area of Kanji character classifiers, character segmentationalgorithms, and linguistic processing are described.
2.2.Kanji character classifiers
Inthe 1970s, there were two competing approaches to characterrecognition, structural analysis and template matching (or thestatistical approach). Contemporary commercial OCRs were usingstructural methods to read handprinted alphanumerics and Katakana,and template matching methods to read printed alphanumerics.Tem-plate matching methods had been experimentally proven to beapplicable to printed Kanji recognition by the late 1970s[23--26] ,but their applicability to handwritten (or handprinted) Kanji was inquestion. The problem of recognizing handwritten Kanji seemed like asteep, unexplored mountain. It was clear that neither the structuralnor the simple template matching approaches could conquer it alone.The former had difficulty with the huge number of topologicalvariations due to complex stroke structures, while the latter haddifficulty with nonlinear shape variations. However, in light ofprevious work on handwritten numeral recognition using a templatematch-ing approach, the latter approach seemed to have a greaterchance of success [27] .
Thekey was the concept of blurring as feature extraction, which wasapplied to directional features and found to be effective inrecognizing handwritten Kanji [27,28]. The introduction of continuousspatial feature extraction made the optimum amount of blurringsurprisingly large. The first Hitachi OCR for reading handprintedKanji used simple template matching based on blurred directionalfeatures where the feature templates were four sets of 16× 16 arraysof gray values. The directional feature, which was patented in Japanin 1979, was computed using a two-dimensional gradient to determinestroke direction ( Fig. 2) and was even applicable to grayscaleimages [29] . Although it was only indirectly relevant, Hubel andWiesel's work encouraged our view that the directional feature waspromising [30] . Nonlinear shape normalization [31--33] andstatistical classifier methods [12,34] boosted recognition accuracy.We learned that blurring should be considered as a means of obtaininglatent dimensions (subspace) rather than as a means of reducingcomputational cost, though the effects might seem similar. Forexample, the mesh size of 8× 8 used in statistical approaches wasdetermined by the optimum blurring parameter in light of the Shannonsampling theorem, and bigger mesh sizes with the same blurringparameter did not give better recognition performances.
Thethorough studies of the research group led by Prof. Kimuracontributed to advancing statistical quadratic classifiers [12] ,which were successfully applied to handwritten Kanji recognition.Actually, the basic theory had been known, but computers of the1970s did not have sufficient computational power to be applied tostudies of such statistical approaches. Today, the four-directionalfeature vector for Kanji patterns consists of 8 × 8 × 4 elements,and the subspace obtained by statistical covariant analysis is offrom 100 to 140 dimensions. However, the size of the 8× 8 array issurprisingly (counter-intuitively) small in light of many complexKanji characters. Recognition accuracy for individual freelyhandwritten Kanji is not yet high enough, however. Therefore,linguistic context such as name and address is used to enhance totalrecognition accuracy. To reduce computational cost, cluster-basedtwo-stage classification is used
toreduce the number of templates that must be matched. One of therecent advances in Kanji (and Chinese character) recognition is thereduced size of recognition engines designed especially for mobilephone applications. A compact recognition engine reported in Refs.[35,36] requires only 613kB of memory to store parameters torecognize 4344 classes of printed Chinese characters.
2.3.Character segmentation algorithms
Inthe 1960s and 1970s, a flying-spot scanner or a laser scanner with arotated mirror was used together with a photo-multiplier to convertoptical signals into electrical signals. Character segmentation wasusually carried out with the help of these kinds of scanningmechanisms. For example, forms for handprint reading used marks on anedge that signaled the presence of a character line to be scanned. Inaddition, the locations of writing boxes on the forms were registeredbeforehand, and the colors of the boxes were trans-parent to thescanner sensor. Therefore, OCRs could easily extract images thatcontained exactly one single handprinted character.
Then,in the 1980s, semiconductor sensors and memories appeared, enablingOCRs to scan and store images of whole pages. This was an epochmaking change that was significant to users because it relaxed strictconditions on OCR form specifications, for example, by enabling themto use smaller non-separated writing boxes. However, it required asolution of the problem of touching numerals and change in how imagesare represented in memory[37] . Be-fore this change, scanned imageshad been arrays of binary pixels, and segmentation was pixel-based,but from this time on, the bi-nary image in the memory wasrepresented by run-length codes. The
runlengthrepresentation was suited to conducting connected component analysisand contour following. The connected components were processed asblack objects rather than as pixels. In 1983, Hitachi produced one ofthe first OCRs that could segment and recognize touching handwrittennumerals based on a multiple-hypothesis segmentation--recognitionmethod (Fig. 3 ). Contour shape analysis was able to identifycandidates of touching points, and multiple pairs
offorcedly separated patterns were fed into the classifier. Byconsulting the confidence values from the classifier, the recognizerwas able to choose the right hypothesis. This direction of changeshas led us to forms processing whose ultimate goal is to read unknownforms, or at least those forms that are not specifically designed forOCRs. However, this means that users might become less careful intheir writing, so OCRs have to be more accurate for freelyhandwritten characters as well.
Thesegmentation problem was far tougher in postal address recognition.Fig. 4 shows horizontally handwritten addresses. The width of acharacter varies by as much as a factor of two, and some of theradicals and components are also valid characters. As shown in Fig. 4, it is difficult to group the right components to form the rightcharacter patterns, where some characters are quite wide and othersnarrow. To resolve the grouping problem, linguistic information (oraddress knowledge) is required in addition to geometric andsimilarity information. This issue will be discussed in more detailin Section 3.
2.4.Integration of linguistic information
Majorbusiness uses of handprinted Kanji OCRs have been the reading ofnames and addresses in application forms. In such applications, toavoid the segmentation problem, forms have separate preprinted fixedboxes, but how to achieve highly accurate word/phrase recognition isstill a question. We can utilize a priori linguistic knowledge tochoose the right options from the candidate lattice to accuratelyrecognize words and phrases. Here, the lattice is a table in whicheach column carries candidate classes, and each row corresponds tocharacters on the sheet. If a string consists of N Kanji charactersand there are K candidates for each, there are KN possibleinterpretations (or word recognition results). The linguisticprocessing consists of choosing one of the many possibleinterpretations. To do this, we developed a method based on a finitestate automaton as a key technique [38] . The basic idea is to throwL lexical terms at the automaton, and see which terms the automatonaccepts, where the model of the automaton is dynamically generatedfrom the lattice (Fig. 5). L is usually a number as big as severaltens of thousands, but only the terms whose first character appearsin the first column of the lattice are to be accepted. To improveaccuracy, we may consider the terms whose second character appears inthe second column of the lattice as well. Such terms are fed into theautomaton one by one, and the state transitions determine a path (aseries of edges). Then the corresponding penalties are summed up andassociated with the input term. Passing the first edge gives apenalty of zero, and passing the last gives a penalty of 15 when K =16. In this way, a term with the smallest penalty is deter-mined tobe the recognized word. The number of candidates for each characteris adaptively controlled to be equal to or less than K ,toexcludeextremely unlikely word candidates. This algorithm has been usedsuccessfully for address phrases, provided that the characters arereliably segmented. Marukawa et al.'s experiments showed thatcharacter recognition accuracy was raised to 99.7% from 90.2% for alexicon with 10828 terms, resulting in address phrase recognitionaccuracy of 99.1%. Here, we can note that error occurrences are notstatistically independent. Linguistic processing that solvesdifficult segmentation problems (cf. Fig. 4) is discussed in Section3.
3.Robustness design to deal with uncertainty and variability
Postaladdress recognition was an ideal application for re-searchers in thesense that it presented many technical challenges, but, at the sametime, the innovation was an expected one for post office automationand the investments really paid off. In the 1990s, R&D projectswere conducted in the United States, Europe, and Japan to developaddress readers that could recognize freely handwritten and printedfull addresses. These were intended to automate carrier
sequencesorting, a tedious task for postal workers. The recognition task wasto identify an exact delivery point by recognizing the fulldestination address including street and apartment numbers. Theproblem in Japan is to identify one of 40000000 address points. Inthis section, the main issues of robustness design intended to dealwith uncertainty and variability are discussed based on theexperience of the author's team [39,40].
Japaneseaddress recognition is a difficult task as shown in Fig. 6. The readrates for printed and handwritten mail are higher than 90% and 70%,respectively. Images of the rejected mail pieces are sent tovideo-coding stations where human operators enter addressinformation. The results of automatic recognition and human codingare transformed to address codes, which are then sprayed on thecorresponding mail pieces as they run through the sorting machine.After the address codes are mapped to numbers that show a carriersequence, the mail pieces can be sorted in sequence by using thetwo-pass radix sort method.
Therecognition system consists of a high-speed scanner, imagepreprocessing hardware, and the computer software that carries outlayout analysis for address block location, character linesegmentation, character string recognition (i.e., address phraseinterpretation), character classification and post processing (Fig. 7). As can be seen in the block diagram, there are many modules thatmake imperfect decisions; i.e., uncertainty is always involved.Algorithms to solve
specificproblems are susceptible to variations in the images, so the mostbasic questions are how to deal with uncertainty and variability andhow to implant robustness into the system. A more appropriatequestion may be how to compose such a recognition system from smallpieces of recognition modules, or how to connect those modules.
Inanswering these questions, it should be recognized that there aredesign principles that can guide researchers and engineers. We maycall them robustness design principles. Table 1 lists them and givessimple explanations. In the following subsections, five suchprinciples are discussed.
3.1.Hypothesis-driven principle
Variabilitymeans that no one solution can fit all situations. There-fore,problems must often be divided into a certain number of cases with adifferent solution (problem-solver) to each case. However, the caseto which an input in question belongs is unknown. Thehypothesis-driven principle can be applied in such cases, and theproblem of Japanese address block identification is one such case.There are six layout types basically, but in real life, there areactually twelve types because envelopes are sometimes usedupside-down. The approach we take is to choose salient features todistinguish between such cases and to evaluate the likelihood of eachcase based on the observed value of such salient features. As ageneral framework of the hypothesis-driven approach, we call the casea hypothesis and the observed salient features evidence, and astatistical hypothesis test method may be used to evaluatelikelihood. The a posteriori probability of the k -th hypothesisafter observing evidence for this hypothesis can be computed as inEq. (1), where Hk represents the k -th hypothesis, and e k thefeature vector for the kth hypothesis. In Eq. (1), L is a likelihoodratio of hypothesis H k to null hypothesis¯Hk and is computed as inEq. (2) assuming the statistical independence of the features.Functions, P( ki |H k) andP(eki|¯Hk), can be learned from thetraining samples.
Therefore,observing evidence {ek|k = 1,...,K } for all hypotheses makes itpossible to compute L( e k|Hk) and P(Hk|ek) accordingly, to find themost probable hypothesis [41] .
Inthe hypothesis-driven approach, after identifying candidates ofhypotheses, the corresponding problem-solvers applicable only to thatkind of input are called to process the input.
3.2.Deferred decision/multiple-hypotheses principle
Ina complex pattern recognition system, many decisions must be made toobtain the final result. As always, each decision is not 100%accurate, so the decision-making modules cannot be simply cascaded.Each module should not make a decision but should defer the decisionand forward multiple hypotheses to the next module. The idea itselfis a simple one. In the case of postal address recognition, there canbe as many functional modules as shown below:
• Lineorientation detection
• Charactersize (large/small) determination
• Characterline formation and extraction
• Addressblock identification
• Charactertype (machine-printed/handwritten) identification
• Script(Kanji/Kana) identification
• Characterorientation identification
• Charactersegmentation
• Characterclassification
• Wordrecognition
• Phraseinterpretation
• Addressnumber recognition
• Building/roomnumber recognition
• Recipientname recognition
• Finaldecision making (accept/reject/retry)
Thesefunctional modules generate multiple hypotheses each of which is thenforwarded to the next module, which again generates multiplehypotheses. This process therefore creates the kind of hierarchicaltree of hypotheses shown in Fig. 8 . The question here is how to findwhich optimum branches to follow to reach the best possible answer inthe shortest possible time. Among the well known search methods, webasically use the Hill Climbing Search with backtracking, by which wecan reach the optimum solution in the shortest time. When an optimumbranch is rejected at a later stage because it has a confidence valuesmaller than a preset threshold, other branches are processed. Theuse of the Beam Search at the later stages effectively boosts therecognition accuracy, while its use in earlier stages is too costly.Search control on the number of hypotheses to generate is importanttrade-off between time and accuracy because computational time islimited to 3.7s in our case. Of course, shorter is better because itrequires less computational power.
3.3.Information integration principle
Werecognize three kinds of information integration known in thecharacter and document recognition field to attack the uncertaintyissue: (1) process integration, (2) combination-based integration,and (3) corroboration-based integration. The first approach, processintegration, integrates two or three processes to form a singleproblem-solver. Examples are segmentation—recognition methods andsegmentation--recognition--interpretation methods.
Thisapproach started in the area of speech understanding back in the1970s. The second combination-based integration approach is the onetaken in character classification and known as classifier combinationor classifier ensemble [13--15] . Different classifiers such asstatistical and structural classifiers and neural networks arecombined (integrated) to deduce a single result, in the expectationthat the classifiers will behave complementarily. Methods known asmajority voting and Dempster Shafer approaches can be used toimplement the algorithm. Finally, corroboration-based integration isthe approach of finding additional evidence that supports the resultor looking for multiple input information sources for the sameinformation. A good example is reading bank check amounts byrecognizing both the courtesy amount (numerals) and the legal amount(numbers in words).
Inpostal address recognition, both the postal code and the addressphrase in words are read to obtain more accurate results. Recipientname recognition is another example of corroboration. This approachis taken when street numbers are not recognized. In postal addressrecognition, the most important consideration is to integrate thethree processes of character segmentation, character classification,and interpretation of the phrases (or linguistic processing). Asdescribed in previous sections, address knowledge is required toresolve the ambiguities in segmentation incorporation withgeometrical information[42] and character similarity, so simple application of the multiple-hypotheses principle was not sufficient.An approach known as the lexicon-directed or lexicon-driven approachhas been developed and can be considered a hypothesis-drivenapproach, as explained below. The approach is illustrated in Fig. 9,where an input pattern is interpreted by searching for the path inthe presegmentation network ( Fig. 10) that best matches the path inthe network that represents linguistic knowledge ( Fig. 11 ). We can
saythat this is the equivalent of searching for a path in the linguisticnetwork that best matches a path in the presegmented network [18,19].This interpretation of the knowledge-directed recognition process isin line with an explanation given by Simon[43] :
When it is solvingproblems in semantically rich domains, a large part of theproblem-solving search takes place in long-term memory and is guidedby information discovered in that memory.
Inour case, the long-term memory refers to the linguistic knowledge,and the short-term memory refers to the presegmented network.
Wehave developed several versions of such algorithms, one of which(Fig. 12) was presented by Liu et al. [19] . The recognition rate ofthe lexicon-driven handwritten address recognition algorithm was83.7% with 1.1% error in an experiment, which was done using 3589actual mail pieces and a lexicon containing 111349 address phrases.The linguistic model was represented in the TRIE structure, and thesearch was controlled by the Beam Search method. Recognition time wasabout 100ms using a Pentium III/600MHz machine.
3.4.Alternative solutions principle
Thereare many image level problems including touching characters, touchingunderlines, window shadow noise, cancellation stampscovering/touching address characters, and so on. The alternativesolutions approach is to provide more than one solution to a problem.It effectively provides solutions that are complementary to eachother. For example, the problem of touching characters may be solvedusing a holistic approach or a forced separation (dichotomizing)approach. Especially when dealing with numerals, a pair of touchingnumerals can be treated as one character out of 100 classes. Trainingsuch holistic classifiers enables the results of the holistic anddichotomizing classifiers to be merged producing more reliablerecognition results. Another example of the alternative solutionsapproach is used to solve the window noise problem. When existence ofwindow noise is suspected, two problem-solvers are needed. Oneattempts to eliminate such noise by erosion (thinning) operation,assuming the shadow is thin or faint. The other attempts to extractline segments that form a frame, assuming the shadow is rather solid.These two problem-solvers are used hoping one will succeed.
3.5.Perturbation principle
Theprinciple of perturbation is to modify the problem slightly when itis difficult to solve and to try again to solve it. If patternrecognition were such a continuous process, the perturbationprinciple would not work. In reality, however, it is often adiscontinuous process. Very small modifications may change the finalrecognition results. It is hoped that the change is from rejection tocorrect recognition or from error to correct recognition. Thisapproach was used in the 1980s to recognize handwritten numeralsusing a structural approach. Because slight topological variationscaused rejection, perturbation of parameters or of input imagesimproved the recognition rate. In recent years more systematicstudies have again shown the effectiveness of the approach. Inputimages are perturbed by various transformations such as morphological(dilation/erosion) and geo-metrical transformations (rotation,slanting, perspective, shrinking, and expanding). In Ha and Bunke'swork [44] , handwritten numerals were transformed in twelve ways andrecognized using the frame-work of classifier combination. Theirapproach recognized difficult, eccentric handwriting better thanclassical classifiers such as k –NN and neural network. By the way,blurring is one of image transformations but has not been applied inthe context of perturbation. Blurring used in character featureextraction is not the kind of 'slight transformation'.
Theperturbation approach has also been successfully applied to Japanesepostal address recognition. Our test of the approach achieved about10--15 percentage point improvements in recognition rates on theaverage. When we did not set limits on recognition time and repeatedmore perturbation operations including rotational transformation,rebinarization, and some other parametric modifications in sequence,we found that 53% of rejected images were correctly recognized with a12% error rate. Although the result was attractive, reduction ofadditional errors is a necessary step to using this approach. Onepossible way to pursue this is to apply the combination scheme as Haand Bunke did [44] . Instead of taking the first recognition resultafter a series of rejections, multiple perturbations may besimultaneously applied yielding one result by voting, for example. Inthe light of ever increasing computing power, this approach seems tobe very promising. It should be noted here that perturbation is notonly effective to character classification but also effective tolayout analysis, line extraction, character segmentation, and otherintermediate decisions.
3.6.Robustness implementation
Thedesign principles described in the previous subsections con-cern thestructure and algorithms of a recognition system, but classifiers andvarious parameters have to be carefully and simultaneously trainedand adjusted [40] . The same is true even for specificproblem-solving modules. Though minor, many problems emerge duringthe development phases. Robustness implementation, therefore, is adifficult task for researchers and engineers. The following areimportant keys to an efficient and effective development process.
• Livesamples at users' sites
• Robustnessmeasurement using many 'bags' of test samples
• Accelerationdata sets
• Sample-by-samplecause analysis
Ifpossible, it is highly desirable to gather samples from the users'sites. We call these real samples live samples. However, live samplesshould not be mixed into a single sample set while samples areusually collected in multiple sessions. It is important to choose theright occasions to capture samples because sample characteristicsvary depending on the operational modes and seasonal tendencies.With-out mixing the collections, we have kept samples in manydifferent 'bags'. Recognition rates (or recognition accuracy) may bemeasured for each of the bags (or data sets), as shown in Fig. 13.Here, a trick in the graph is that the data set numbers arerearranged so that the recognition rates are in decreasing order.Arranging the graph this way enables observation of the profiles ofrecognition rates, where a steeper slope means that the recognitionsystem is less robust. In addition, if recognition performance for adata set is very low, then we can re-examine that data set in detail,which is small in size, to identify the cause of the problem (i.e.,low recognition rate).
Accelerationdata sets are collections of samples that have been rejected orerroneously recognized by a version of the recognizer concerned.Every sample in the data sets may be given a unique identifier bywhich the samples can be subjected to sample-by-sample causeanalysis, and more importantly, by which the improvements can betraced throughout the development process. If names and problem codescan be assigned to problematic situations, the non-straightforwardprogress resulting from the remedying processes can be managed moreappropriately.
4.Future prospects
A40--50 year overview of OCR history and an overview of the currentmarket may give rise to the view that the technology is al-mostmatured. However, it is clear that the technology is still in themidst of development and is far inferior to human cognition. From theviewpoint that the technology is mature, it seems that the cur-rentstate is the long tail part of the market (or applications).According to this view, the ''head'' part of the market has a smallnumber of applications having huge amount of documents to read. Theyare business form reading, bank check reading and postal addressread-ing. They have been investment-effective due to sufficientlyheavy demands. Or return on investment has been almost alwayspromised. Of course, the technological advances have elongated thehead part towards the tail, but the remaining tail is very long. Thethree application areas considered parts of the head have also tailparts. There are a lot of business forms, checks, and mailpieces thatare very difficult to read. More advanced recognition techniques areundoubtedly needed. For example, small to medium-sized enterprises(SME) in Japan are still using paper forms to do bank transactionsand paper income forms to report to local government. The number oftransactions carried out by each such company is not very large, andthere is not much incentive for them to innovate. Banks that receivedifferent forms from such companies, therefore, want to use moreintelligent, versatile OCRs. The long-tail phenomenon applies topostal address recognition as well. The questions are if the demandside can foresee the return on the investment in proposed newproducts and systems, and if the scientists and engineers canconvince them of the return, while technical problems are piecewiseand diverse. These are typical long-tail questions.
Intalking about the future from a different angle, there is thequestion of chicken and egg, or need and seed, which is difficult toanswer in general. From the industry's viewpoint, it seems moreimportant to think of needs, or at least latent needs, and the futureneeds seem to be subjective at least for now. The well recognizedunfilled needs of today include: (1) office document archives fore-Government, (2) handwriting for human interface of mobile devices,(3) text in videos for video search, and (4) books and historicaldoc-uments for global search. There are also two other applications:(5) text-in-the-scene for information capture, and (6) handwritingdoc-ument management for knowledge workers.
Unknownscripts and unknown languages are a big handicap for travelers inforeign countries making quick decisions on the road, in shops, atthe airport, etc. A mobile device with a digital camera, i.e., anInformation Capturing Camera [45] may be an aid in such a situation (Fig. 14). With a higher performance microprocessor, text in the scenecan be recognized. The technical challenges to this technologyinclude color image processing, geometric perspective normalization,text segmentation, adaptive thresholding, unknown script recognition,language translation, and so on. Every mobile phone in Japan isequipped with a digital camera, and their microprocessors arebecoming more powerful. Some of digital cameras now have suchintelligent functionality to locate faces in images to be taken. Thequestion is why is text recognition so difficult. Some mo-bile phonesin Japan can now recognize over 4000 Kanji characters [36] . Whatseems interesting to challenge is a dynamic recognition capability,which ensures high recognition performance by repeatedly recognizingmultiple shots of camera images without users' conscious operation.Users may try various angles and positions aiming at a target ofrecognition. It can be considered interactive perturbation.
Anotherattractive area is a digital pen and handwriting document management.The act of handwriting is being reconsidered based on its importancein education and knowledge work contexts. The act of writing helpspeople read, write, and memorize, and we may integrate these actsinto information systems by using today's digital pens, which cancapture handwritten annotations and memos in a very natural way. TheAnoto functionality is one of such advanced techniques and digitallycaptures handwriting stroke data and other related data (Fig. 15).There are research groups that are using such digital pens to createmore intelligent information management systems [46--49] . Their goalis to seamlessly manage documents with digital inks. A groupadvocating 'Information Just-in-Time' (iJIT )is developing a pilotsystem for researchers that supports their note-taking and hybriddocument management [49] . Their handwritten research notebooks canalways be kept compatible with their digital counterparts incomputers. By doing so, they can easily share information in thegroup even when they are located remotely. Another feature of thesystem is that users can print any digital document in such a waythat the printed document is sensitive to a digital pen (Fig. 16 ).In other words, the content of a digital document is printed overlaidwith Anoto dots. Therefore, the users can mark and write annotationsonto those printouts, and handwriting strokes are captured andsynchronized with the corresponding document already existing incomputer. The value of this kind of system is that a dig-italdocument in computer comes to have the same annotations as thephysical counterpart, meaning that they can throw away paperdocuments anytime without any loss of information. This conceptenables users to work equally well in the digital world and in thereal world. This is an attempt to go beyond the myth of thepaper-less office [50] . When such a use of digital pens becomes acommon practice, it will be a natural demand to ask for capabilitiesof hand-written character recognition, handwritten query processing,and more intelligent knowledge management. Effort to createinformation systems that would require recognition technology is away that we may pursue. We hope more advanced information systemsrequire more advanced recognition technology.
5.Conclusion
Visionand fundamental technologies are both key to the future of ourtechnical community. Vision takes the form of forecasted applicationswith new value propositions. For investment to be made in newtechnology, such new propositions need to be attractive to manypeople or at least to some innovative people. This is a top--downapproach to innovation. Fundamental technologies may start innovationfrom the bottom as well. Here, the technologies we are discussinghave two parts: one is the technology that supports our communityfrom the bottom; the other is the technology of our own, i.e.character and document recognition. For the first part, we have seenimpacts of advanced semiconductor devices, high-performancecomputers, and more advanced software development tools, which havesupported the advances in recognition technology. They not onlyenabled more advanced OCR systems on the surface, but also invitedand promoted more academia into this community, which have alsocontributed to the advances in recognition tech-nology. We would liketo see this kind of virtuous cycles happen forever.
Acknowledgments
Theauthor is grateful to the members of his research team in Hitachi whoworked on development of the postal address recognition system: H.Sako, K. Marukawa, M. Koga, H. Ogata, H. Shinjo, K. Nakashima, H.Ikeda, T. Kagehiro, R. Mine, N. Furukawa, and T. Takahashi. He isalso grateful to Dr. C.-L. Liu at the Institute of Automation of theChinese Academy of Sciences, Beijing, and Prof. Y. Shima at MeiseiUniversity, Tokyo, for the work they did at our laboratory. Theauthor also thanks Prof. G. Nagy of Rensselaer Polytechnic Institutefor his valuable discussions and comments on this manuscript. Thanksalso go to Dr. U. Miletzki of Siemens ElectroCom for providinginformation regarding their historical work.