中文郵政地址與鄰近相關資訊擷取之研究; Extraction of Chinese postal addresses and associated information from general Web pages

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/48534

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/48534

題名:	中文郵政地址與鄰近相關資訊擷取之研究;Extraction of Chinese postal addresses and associated information from general Web pages
作者:	黃嘉毅;Chia-Yi Huang
貢獻者:	資訊工程研究所
關鍵詞:	相關資訊擷取;條件隨機域;地址擷取;associated information extraction;conditional random fields;address extraction
日期:	2011-08-27
上傳時間:	2012-01-05 14:57:22 (UTC+8)
摘要:	地址在人們的生活中是經常被使用的資訊，人們常需要透過網路查詢相關實體商店、學校或組織的地址，再經由地圖標示服務確定其實際方位。然而並不是每一個網站同時提供地址與地圖標示的功能，因此本研究目的是希望設計一個能從網頁中自動擷取中文地址的服務，並結合地圖標示功能，將擷取到的地址以及其相關資訊，一併標示在地圖上，提供使用者簡單方便的地圖標記資訊服務。我們的系統分為兩個部分，第一部分，將網頁先經過單獨中文字元切字與Yahoo中文字斷詞兩種斷詞方法處理後，透過條件式隨機域的方式搭配BIEO與IO兩種標記法訓練出地址擷取的模型，輸入的網頁經過此模型的測試過程後並擷取地址；第二部份，則以擷取到的地址為基礎，在網頁中擷取與地址相關的資訊，找出包含地址和相關資訊的地址區塊邊界。實驗結果得知，我們的地址擷取中以所有網頁的總地址為單位的效能可以提升F-measure至九成九，而以個別網頁中的地址為單位的平均效能提升平均F-measure至九成七，同時對於九成二的資料可以正確的擷取到相關資訊。 Address Information is closely linked to people's daily life. People often need to query addresses of shopping malls、schools and organization, and using the service of map marking to locate the direction. However, not all web pages providing addresses and the facility of map marking at the same time. Therefore, designing a mechanism to extract Chinese addresses automatically from web pages to combines the facility of map marking and marks the extracted addresses and the related information on the map. The service provides users in a convenient and easy way to using the information service of map marking. Our system is divided into two steps: the first step is using Conditional Random fields to train the model of address extraction. The pages we input enter the testing process of model of address extraction and output the segment of address. The second step is using extracted addresses as landmarks to extract related information and finding out the correct boundary of address blocks. In terms of the result of experiment, the F-measure of extraction by Conditional Random field is up to 0.9914. The accuracy of the incorrect boundary is 0.9212.
顯示於類別:	[資訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	665	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....