中文商業名片辨識及後處理; Recognition and Postprocessing of Chinese Business Cards

NCU Institutional Repository > 資訊電機學院 > 電機工程研究所 > 博碩士論文 > Item 987654321/9017

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/9017

题名:	中文商業名片辨識及後處理;Recognition and Postprocessing of Chinese Business Cards
作者:	陳泰宏;Tai-Hung Chen
贡献者:	電機工程研究所
关键词:	隱藏式馬可夫模型;語意;後處理;中文;辨識;名片;維特比演算法;語言模型;language;linguish;Viterbi;OCR;HMM;card
日期:	2000-07-10
上传时间:	2009-09-22 11:39:31 (UTC+8)
出版者:	國立中央大學圖書館
摘要:	名片傳達許多重要的資訊，為了更有效率的使用這些資訊，自動地抽取這些資訊並建立電子資料庫是必要的，這類的程序稱之為名片辨識系統。一般而言，名片的辨識主要包含三步驟，首先，前處理級將處理名片影像並抽取名片上的文字，第二個步驟是針對名片版面作分析，最後則是後處理級，採用語意等方法來改善名片處理系統的辨識率。這篇論文主要研究的目標為中文商業名片的辨識問題。我們假設名片上的字元已經被抽取出來並且已經分析過名片的版面，由於名片上的字元太小以及字型變化太大導致了OCR應用在名片上的低辨識率，我們研究的目的主要在改善這個問題。在我們的方法中，採用了HMM來辨識中文商業名片上的字元，由左而右的HMM模型用來辨識字元並輸出前十名候選字。在後處理級中，語言模型接著用來改善辨識的結果。Viterbi演算法被應用在後處理的校正上，以bigram當作語意的資訊用來搜尋前十名候選字中的正確字元，所得到的最佳字元序列為後處理級中所改善的結果。我們的實驗建立在辨識中文商業名片的公司欄位和地址欄位，用來訓練bigem和HMM的資料庫為電話簿上的資料，100張名片的地址欄位和30張名片的公司欄位被用來作測試。實驗的結果證實了我們提出的方法確實有效。 Business cards convey significant information of personal data. In order to use the information effectively, it is necessary to automatically extract the information to build an electronic business card database. This is called a business card recognition system. In generally, a business card recognition system has three stages. First, a preprocessing stage is needed to perform image processing and extract character images. It then needs a card layout analysis as the second stage. The last stage called post-processing usually adopts linguistics to increase the recognition rate of business card processing. The goal of this thesis is to study the recognition problems of business cards. We assume that characters have been recognized and card layout has been analyzed. Our aim is to improve the low recognition rate of OCR in business card, which happens due to the fact that characters vary greatly in font type and are too small to be recognized. In our approach, Hidden Markov Model is adopted to recognize characters in Chinese business card. A left-right model will output the top-10 candidates as its recognition result. A postprocessing stage is followed to improve the recognition result. A Viterbi algorithm is proposed in the postprocessing stage. The algorithm will use bigram as its linguistic information to search the top-10 candidates. An optimized character sequence is obtained as the improved result of postprocessing. Our experiments are built on the recognition of address item and company item in business cards. Bigram table and Hidden Markov Models are trained with a telephony database. 100 address items and 30 company items are used for testing. Experimental results reveal the validity of our proposed method.
显示于类别:	[電機工程研究所] 博碩士論文

文件中的档案:

档案	大小	格式	浏览次数

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....