樣板網頁結構自動分群;Clustering of Template Page for Data Extraction

NCUIR > College of Electrical Engineering & Computer Science > Executive Master of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/77510

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/77510

Title:	樣板網頁結構自動分群;Clustering of Template Page for Data Extraction
Authors:	吳佳儒;Wu, Jia-Ru
Contributors:	資訊工程學系在職專班
Keywords:	特徵挑選;樣板網頁擷取;階層式分群;非監督式分群
Date:	2018-07-23
Issue Date:	2018-08-31 14:46:33 (UTC+8)
Publisher:	國立中央大學
Abstract:	在網頁資料擷取(Web Data Extraction)的領域中，由於網頁內容多樣及架構的複雜性，要如何自動從各式不同樣板的網頁中擷取出資料，這類型的研究一直面臨相當大的挑戰。網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別，兩者是接受相同樣板的網頁，進行資料擷取或是綱要推導，針對不同網頁樣板來進行分群之研究較為少見。本篇論文提出一個依照網頁結構之相似程度來自動分群的功能，簡化不同網頁樣板之間擷取的問題，針對所設計的網頁特徵來實作非監督式分群與監督式分群，並比較其分群之效能。雖從整體分群效果中來看不甚理想，但於目標群結果可達到在非監督式分群時之精確率 99%，召回率 78%，監督式分群時之精確率 97%，召回率超過 80%。最後，此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統，產生完整的頁面綱要及擷取出所需 POI 相關資訊，進而建立及累積資料庫，以提升相關加值服務之效率及品質。;In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found. This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering. Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value added services.
Appears in Collections:	[Executive Master of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	244	View/Open

社群 sharing

Loading...