中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/93132
English  |  正體中文  |  简体中文  |  Items with full text/Total items : 80990/80990 (100%)
Visitors : 42713572      Online Users : 1350
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
Scope Tips:
  • please add "double quotation mark" for query phrases to get precise results
  • please goto advance search for comprehansive author search
  • Adv. Search
    HomeLoginUploadHelpAboutAdminister Goto mobile version


    Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/93132


    Title: 基於生成資料集和進一步預訓練之百科問答系統;Retrieval-based Question-Answering System based on Generated Dataset and Further Pretraining
    Authors: 馮智詮;Feng, Zhi-Quan
    Contributors: 資訊工程學系
    Keywords: 深度學習;自然語言處理;文本檢索;閱讀理解;問答系統;Deep Learning;Natural Language Processing;Document Retrieval;Muchine Reading Comprehension;Question Answering System
    Date: 2023-07-19
    Issue Date: 2024-09-19 16:43:58 (UTC+8)
    Publisher: 國立中央大學
    Abstract: 近年隨著自然語言處理領域的快速發展和進步,基於Transformer[1]的神經語言模型逐漸被開發出了各種各樣的預訓練演算法以及與之伴隨的資料集和優秀訓練結果如早期的BERT[2]、RoBERTa[3],和後來的DPR[4]等等。在檢索式開放領域問答的雙塔模型文本檢索器,以及文本閱讀理解下游任務的神經語言模型。在過去幾年,此類系統有相當多的實作和改進,但本研究所涉及之中文問答領域,往往存在一個問題,就是在雙塔模型的訓練以及文本閱讀器的訓練方面,缺少與檢索任務高度匹配且資料量較大的開放資料集,類似英文的PAQ[5]資料集,因此,本研究主要通過生成模型生成的方式,以開源中文預訓練新聞預料為基礎,獲得大規模文本-問題資料集,並通過此資料集,強化系統的文本檢索能力以及模型的閱讀理解能力,具體地,本系統分為三個主要部分。
    第一部分在於收集資料,本研究使用MT5[6]預訓練模型生成所需資料集QNews,並也同時對生成資料集實行資料清洗,篩選出較為合理的問題和長度合適的文本。第二部分在於使用QNews資料集中的文本-問題對,對雙塔模型實行領域相吻合的檢索預訓練,提升雙塔模型的檢索效能。第三部分主要通過經長度採樣的QNews資料集,對文本閱讀器進行進一步預訓練,並通過一定的約束,讓模型的參數變動控制在一定範圍。
    通過上述三個主要步驟,本研究意在傳統傳統檢索式開放領域百科問答系統中,一定程度地改善雙塔模型預訓練任務和下游任務的資料形式偏差,並提高神經語言模型在閱讀理解下游任務中的運行效能。
    ;In recent years, with the rapid development and advancement in the field of natural language processing, various pretraining algorithms based on Transformer-based[1] neural language models have been developed, along with accompanying datasets and outstanding training results such as early models like BERT[2], RoBERTa[3], and later models like DPR[4]. These include DSSM document retrievers for retrieval-based open-domain question answering and neural language models for text reading comprehension downstream tasks. Over the past few years, there have been numerous implementations and improvements in such systems. However, in the Chinese question answering domain, there is often a lack of large-scale open datasets that are highly matched to retrieval tasks for training DSSM models and reading comprehension models, similar to the English PAQ[5] dataset. Therefore, this study primarily focuses on generating a large-scale text-question dataset based on open-source Chinese pretraining news corpus through a generative model. Through this dataset, the system′s text retrieval capability and the model′s reading comprehension ability are strengthened. Specifically, this system consists of three main parts.
    The first part involves data collection. In this study, the MT5[6] pretraining model is used to generate the required dataset called QNews, and the generated dataset is also subject to data cleaning to filter out reasonable questions and texts of appropriate length.
    The second part involves domain-matched retrieval pretraining of the DSSM model using the text-question pairs from the QNews dataset to enhance the retrieval performance of the DSSM.
    The third part focuses on further pretraining the reading comprehension model using the length-sampled QNews dataset and controlling the variation of model parameters within a certain range through certain constraints.
    Through the aforementioned three main steps, this study aims to improve the data format bias in traditional retrieval-based open-domain question answering systems to a certain extent and enhance the performance of neural language models in reading comprehension downstream tasks.
    Appears in Collections:[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

    Files in This Item:

    File Description SizeFormat
    index.html0KbHTML15View/Open


    All items in NCUIR are protected by copyright, with all rights reserved.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明