PTT網站餐廳美食類別擷取之研究

NCUIR > College of Electrical Engineering & Computer Science > Executive Master of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/74641

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/74641

Title:	PTT網站餐廳美食類別擷取之研究
Authors:	鍾智宇;Chung, Chih-Yu
Contributors:	資訊工程學系在職專班
Keywords:	機器學習;命名實體辨識;Tri-Training;Machine Learning;Named Entity Recognition;Tri-Training
Date:	2017-07-24
Issue Date:	2017-10-27 14:34:37 (UTC+8)
Publisher:	國立中央大學
Abstract:	隨著資訊科技與網際網路的快速發展加上行動裝置日漸普及化，從網路上獲取生活所需的資訊已成為趨勢主流，然而該如何從豐富且多樣化的大量資料中有效擷取有用的資訊成為一項重大的挑戰，因此資訊擷取（Information Extraction）技術逐漸成為熱門的研究議題，其內容主要是透過整理、篩選…等步驟將非結構化的資料加以整合成為結構化的資料，最後從中有效得擷取出有用的資訊。本研究希望透過資訊擷取技術中機器學習 (Machine Learning) 的方法針對國內最大的電子佈告欄系統 (BBS, Bulletin Board System) 「PTT」中的「Food」版發展出一套自動化擷取文章中餐廳相關資訊並判斷餐廳類別的方法，讓餐廳資訊的取得更加快速且便利。本文架構主要分為三個部分，第一部分為餐廳相關資訊擷取，透過 PTT Crawler 擷取PTT Food 版上的文章資訊存入資料庫中進行格式化處理，並以人工分析的方式瞭解資料的概貌，接著藉由關鍵字搜尋的方式掃描文章以擷取文章標題、餐廳名稱、電話、地址及 URL資訊。第二部分則是進行餐廳類別擷取，藉由前處理作業時分析資料的結果得知72.5% 的餐廳類別隱含在文章的標題中，因此以文章標題作為餐廳類別的擷取來源，透過 CKIP系統進行斷詞後參考其結果隨機挑選10,000筆標題資料針對隱含其中的餐廳類別進行人工標記；最後再將標記後的資料透過 WIDM 研究室整合了條件式隨機域 (CRF, Conditional Random Field) 所開發的 WIDM_NER_TOOL 搭配BIESO標記法訓練模型。最後則是將標題資料輸入訓練好的模型後分別進行監督式學習與半監督式學習的實驗，並從實驗結果得知利用此法在餐廳類別的擷取可獲得不錯的效果。;With the rapid development of Internet information technology and the popularity of mobile devices, access to information from web pages has become a trend, but how to extract useful information from rich and diverse information becomes a major challenge. The development of information extraction technology has gradually become a popular research topic, its main purpose is through the sorting、screening, unstructured information will be integrated into a structured data, and finally can effectively extract useful information. In this study, we hope to develop a system to automatically extract restaurant type from the FOOD board of PTT of the largest BBS web site in Taiwan through the Machine Learning Method in information extraction technology, so that users can get more convenient and fast access restaurant information This paper is divided into three parts, the first part is pre-processing, we extract the articles from the PTT FOOD site by the PTT Crawler and then format the data; based on the extracted articles, we analysis of the keyword by statistical from the article to extract the Title、Restaurant Name、Telephone、Address and URL information; The second part is restaurant type extraction; by pre-processing analysis, we know that 72.5% of the restaurant type was implied in the title; we segmented the extracted title data through the CKIP System, and then refer to the results for manual labeling. We used WIDM_NER_TOOL which bundled CRF++ package to train the labeled data and BISEO markers to train an extraction model, the input data are used to capture the restaurant type after the model′s testing process. The last part of the article is experiment, we used the labeled data for supervised learning and used unlabeled data for Semi-Supervised to evaluate system performance. Finally we got a good result from experiment result that used this method in restaurant type extraction.
Appears in Collections:	[Executive Master of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	506	View/Open

社群 sharing

Loading...