文字文件的分類是將包含在文件中的資訊抽取出來,分析出文件所要表達的抽象語意以及作者所要傳達的訊息並依需求分類管理。 其中,文件分析(document analysis)的技術提供了將前景與背景分離的二值化(binarization)技術、將區塊物件分解出來的切割(segmentation)技術、經由排版分析(layout analysis)所獲得的幾何結構(geometrical structure)轉換成閱讀順序的邏輯結構(logical structure)的讀序分析(reading order analysis)技術、辨識文字影像資訊的字體辨識偵測(font style detection)與字型分類(font type classification)技術以及評量文件內容相似程度的隱含文法分析(latent semantic analysis)技術。 在文件分析程序中,全域的二值化臨界值(global threshold)先被選定,並進行區塊切割的處理,接著針對個別的區塊,區域性的二值化臨界值(local threshold)依個別區塊不同而決定,同時將字元區塊個別的切割出來。最後,各區塊間的邏輯關係由針對人類閱讀習慣所設計的讀序分析所定義。 對於所切割出來的文字區塊,利用虛擬筆劃(virtual stroke extraction)抽取出字元影像的外觀輪廓,以根據斜體字轉換原則對筆劃結構所造成的影響,歸納出無需字元辨識的斜體字辨識法則;以字元寬度及筆劃結筆的截線存在與否的分類法將字元影像的字型分成三大類及以字詞整體筆劃寬度的差異將粗體字辨識出來。 最後,根據整篇文字間字型與字體變化的情形,並參照各字詞間相對位置的變化所建立的語意樹,找出可能可以表達該文件內容的特徵字詞。以各文件在以各字詞間相對位置的特徵向量所構成的語意空間中的相對位置,評量文件內容的相似程度並將文件加以分類。 The task of textual document image classification is to classify and manage textual document images by extracting the information in textual document images in order to analyze the abstract meaning embedded in the documents and the message that the authors want to express. Several techniques of document analysis have been proposed to perform the procedure of information extraction. Among them, the binarizarion technique will separate the foreground from the background, the segmentation technique will extract each object from the foreground, the geometrical structure formed by the layout analysis will be transformed into logical structure by employing the reading order analysis technique, the font style will be detected by utilizing the font style detection technique and the font type will be classified by the font type classification technique, and the similarity between different textual documents is estimated by the latent semantic analysis technique. During the document analysis process, a global binary threshold is selected to perform the block segmentation task. Then, the local binary thresholds are decided for each paragraph block independently to more precisely segment character blocks. Finally, the logical relation between each pair of paragraph blocks is defined by the reading order analysis according to the reading habit of human beings. The contour of each character image will be extracted and formed by employing the proposed virtual stroke extraction technique and the italic style character can be detected by the structural rule that is derived from the effect of shear transformation without the process of optical character recognition. The font type will be classified into three categories by the feature of width of character image and the existence of serif in the end of strokes. The boldface can be detected by checking the average width of strokes in each word. The feature words to represent the content of document are selected according to the information of font style and type and the semantic tree that is created by the relative position of each pair of words. The similarity between two textual documents is calculated by the included angle of the feature vector constructed from the relative position of feature words in the textual document. Finally, document classification is performed based on the extracted content.