Data Exploration on Climate Text Records through Natural Language Processing and Statistical Analysis–An attempt to experiment on temperature and locusts relative events during Ming and Qing Dynasty

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/86422

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/86422

Title:	Data Exploration on Climate Text Records through Natural Language Processing and Statistical Analysis–An attempt to experiment on temperature and locusts relative events during Ming and Qing Dynasty
Authors:	黃詩芸;Huang, Shi-Yun
Contributors:	資訊工程學系
Keywords:	歷史氣候文獻;資料探索;文字探勘;多標籤分類;關鍵字擷取;BERT;Historical Climate Research;Data Exploration;Text Mining;Multi-label Classification;Keyword Extraction;BERT
Date:	2021-04-06
Issue Date:	2021-12-07 12:49:05 (UTC+8)
Publisher:	國立中央大學
Abstract:	氣候變遷一直是國際關注的議題，其中歷史氣候研究也是探討氣候問題過程中重要的一環。因為歷史氣候研究的定義可能依據不同的資料來源和研究方法有所不同，本研究主要針對歷史文獻的分析方法進行探討。本研究報告的主體是一個資料探索 (Data Exploration) 的過程，分析對象為《中國三千年氣象記錄總集》中明、清時期氣候事件相關的文字記錄，並嘗試在分析過程中導入文字探勘 (Text Mining) 技術以及氣候模式模擬出的數據資料。研究內容主要可拆分為兩大部分：第一部分為文字探勘，目標從非結構化的文本資料中抽取出對後續分析有用的資訊。氣候類型的分類標準和訓練資料均參考自 REACHES 的研究；分類模型則參考 BERT 所提出的深度學習架構，調整訓練下游分類任務 (Fine-tuning) 的方法後，運用架構中的自注意力機制 (self-attention mechanism) 設計出一套多標籤分類方法，同時能夠利用多標籤分類的結果萃取每個類別各自所對應到的關鍵字。運用自動化蒐集而來的關鍵字列表，可再經人工過適度的檢查和調整，再依據關鍵字之間的特性附加關鍵字屬性生成關聯式資料表 (庫)，後續即可根據不同研究目標彈性地運用「類別標籤」、「關鍵字」以及「關鍵字屬性」抽取相關資料。第二部分則是透過上述方法抽取資料後，運用敘述統計和視覺化方法呈現資料的時空分佈及整體趨勢，初步選定氣溫異常及蝗蟲 (災) 相關的紀錄整理出觀察結果，同時探討紀錄資料的特性與限制，以及進一步搭配氣候模式模擬資料研究的可能性。;This study reports a data exploration process of experiments with historical records that record climate-relevant events. Data exploration techniques can help data analysts efficiently figure out the contour of data through visual exploration. Before exploration, our goal was to extract useful information from the unstructured text data, using the Compendium of Meteorological Records of China in the Last 3000 Years, during the Ming and Qing dynasties, as our text resource. The research consists of two main parts. The first part is text mining. We proposed a method to extract label-specific keywords by a multi-label classification model, which refers to BERT’s deep learning architecture. We can utilize each class’s keywords and attach some predefined attributes to keywords as our metadata information. In the second part, we conducted a spatial-temporal statistical analysis, combined with visualization methods, to observe the records’ overall pattern and characteristics about temperature anomalies and locusts events.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	82	View/Open

社群 sharing

Loading...