中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/77510
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 80990/80990 (100%)
Visitors : 42695676      Online Users : 1459
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/77510


    Title: 樣板網頁結構自動分群;Clustering of Template Page for Data Extraction
    Authors: 吳佳儒;Wu, Jia-Ru
    Contributors: 資訊工程學系在職專班
    Keywords: 特徵挑選;樣板網頁擷取;階層式分群;非監督式分群
    Date: 2018-07-23
    Issue Date: 2018-08-31 14:46:33 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 在網頁資料擷取(Web Data Extraction)的領域中,由於網頁內容多樣及架構的複雜性,要如何自動從各式不同樣板的網頁中擷取出資料,這類型的研究一直面臨相當大的挑戰。
    網頁資料擷取系統主要分為記錄層級(Record Level)和頁面層級(Page Level)兩大類別,兩者是接受相同樣板的網頁,進行資料擷取或是綱要推導,針對不同網頁樣板來進行分群之研究較為少見。
    本篇論文提出一個依照網頁結構之相似程度來自動分群的功能,簡化不同網頁樣板之間擷取的問題,針對所設計的網頁特徵來實作非監督式分群與監督式分群,並比較其分群之效能。雖從整體分群效果中來看不甚理想,但於目標群結果可達到在非監督式分群時之精確率 99%,召回率 78%,監督式分群時之精確率 97%,召回率超過 80%。
    最後,此分群結果可再結合Page-level Information Extraction System (UWIDE) 系統,產生完整的頁面綱要及擷取出所需 POI 相關資訊,進而建立及累積資料庫,以提升相關加值服務之效率及品質。;In the field of Web Data Extraction, due to the diversity of web content and the complexity of the web page structure, the research of extracting data automatically from web pages of different template has always faced considerable challenges. The web data extraction system is
    mainly divided into two categories: Record Level and Page Level. Both input dataset use the web pages of the same template, and are used for data extraction and schema induction. Clustering research on web page of different template is rarely to be found.
    This paper proposes a method to do clustering automatically with the similarity of web page structure, and can simplify the problem of data extraction from different templates in web page. We also use the unsupervised and supervised clustering, which based on our designed features, and compare the performance of both clustering results. Although the overall clustering performance is not well as expected, the results of unsupervised clustering can reach
    a precision of 99% for the target cluster, a recall rate of approximately 78%. A precision of 97%, and a recall rate of more than 80% for supervised clustering.
    Finally, we can generate a complete web page schema and extract the POI-related information via Page-Level Information Extraction System (UWIDE) with this clustering result. It can also be accumulated into databases, to enhance the efficiency and quality of related value
    added services.
    Appears in Collections:[Executive Master of Computer Science and Information Engineering] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML244View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明