運用合成器混合注意力改善BERT模型於科學語言編輯;Improving BERT Model with Synthesizers based Mixed-Attentions for Scientific Language Editing

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Electrical Engineering > Electronic Thesis & Dissertation > Item 987654321/86864

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/86864

Title:	運用合成器混合注意力改善BERT模型於科學語言編輯;Improving BERT Model with Synthesizers based Mixed-Attentions for Scientific Language Editing
Authors:	王昱翔;Wang, Yuh-Shyang
Contributors:	電機工程學系
Keywords:	科技英文;寫作評估;預訓練語言模型;混合注意力;合成器;Scientific English;writing evaluation;pre-trained language models;mixed-attentions;synthesizers
Date:	2021-10-06
Issue Date:	2021-12-07 13:21:39 (UTC+8)
Publisher:	國立中央大學
Abstract:	自動化的寫作評估可以幫助寫作者減少語意表達上的錯誤，提升寫作品質。尤其在科技論文領域中，有相當多的非英文作為母語的寫作者，一個自動化評測工具可以幫助寫作者減少校稿的時間以及人力成本。我們提出SynBERT模型提取語句資訊，用以分辨科技英文論文中的句子是否需要語言編輯。我們以BERT衍生模型ELECTRA作為基底進行改良，使用科技論文作為訓練資料，結合自注意力、區間動態卷積、隨機合成注意力三個不同的注意力，提出一個合成器混合注意力機制，並使用元素替換檢測，作為語言模型的預訓練目標任務，最後經過微調進行科技英文寫作評估。我們使用科技英文寫作評估競賽的AESW2016資料集，作為模型效能評估的實驗資料，該任務目標是要判斷句子是否需要語言編輯，以符合科技論文的寫作體裁，並提供三組資料：訓練集、發展集、測試集，分別包含1,196,940筆、148,478筆、143,804筆資料，其中需要語言編輯者約占四成。藉由實驗結果與錯誤分析可以得知，我們提出的SynBERT在此任務上可以達到最好的F1-score 65.26%，比過去競賽中使用的模型(MaxEnt, SVM, LSTM, CNN) 以及近年新興的模型 (BERT, RoBERTa, XLNet, ELECTRA) 表現都來的好。;Automated writing assessment can help writers reduce semantic errors and improve writing quality, especially in the field of scientific papers, due to a huge number of authors who are not native English speakers. An automated evaluation tool can help writers save the time and labor cost of proofreading. We propose the SynBERT model to extract sentence information for classifying whether sentences in scientific English papers required language editing. We use ELECTRA model as the base architecture and make improvements by using scientific papers as training data and integrating three different attentions: self-attention, span-based dynamic convolution, and random synthesizer into proposed synthesizers based mixed-attentions. We use token replacement detection as the task of the language model and fine-tuned the pre-trained language model on the grammatical error detection task. We use AESW 2016 datasets as the experimental data for the model evaluation. The goal of this task is to determine whether a sentence needs language editing to meet the writing style of scientific papers. It provides three sets of data: training set, development set, test set, respectively contains 1,196940, 148,478, and 143,804 articles, respectively. In the AESW 2016 datasets, about 40% of sentences need language editing. Our proposed SynBERT model can achieve the best F1-score of 65.26%, which is better than the methods used in the competitions (i.e., MaxEnt, SVM, LSTM, and CNN) and outperformed the recent models (i,e., BERT, RoBERTa, XLNet, and ELECTRA).
Appears in Collections:	[Graduate Institute of Electrical Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	130	View/Open

社群 sharing

Loading...