結合自然語言處理及機器學習技術，探討文件分類之應用

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：40

、訪客IP：3.149.214.32

姓名

曾米嬪(Mi-Ping Tseng) 查詢紙本館藏

畢業系所

工業管理研究所在職專班

論文名稱

結合自然語言處理及機器學習技術，探討文件分類之應用
(Combine Natural Language Processing and Machine Learning Technology Explore the Application of Automatic Text Classification)

相關論文

★ 二階段作業研究模式於立體化設施規劃應用之探討–以半導體製造廠X及Y公司為例	★ 推行TPM活動以改善設備總合效率並提昇企業競爭力...以U公司桃園工廠為例
★ 資訊系統整合業者行銷通路策略之研究	★ 以決策樹法歸納關鍵製程暨以群集法識別關鍵路徑
★ 關鍵績效指標(KPI)之建立與推行 - 在造紙業	★ 應用實驗計劃法- 提昇IC載板錫球斷面品質最佳化之研究
★ 如何從歷史鑽孔Cp值導出新設計規則進而達到兼顧品質與降低生產成本目標	★ 產品資料管理系統建立及導入-以半導體IC封裝廠C公司為例
★ 企業由設計代工轉型為自有品牌之營運管理	★ 運用六標準差步驟與FMEA於塑膠射出成型之冷料改善研究(以S公司為例)
★ 台灣地區輪胎產業經營績效之研究	★ 以方法時間衡量法訂定OLED面板蒸鍍有機材料更換作業之時間標準
★ 利用六標準差管理提升生產效率－以Ａ公司塗料充填流程改善為例	★ 依流程相似度對目標群組做群集分析- 以航空發動機維修廠之自修工件為例
★ 設計鏈績效衡量指標建立 —以電動巴士產業A公司為例	★ 應用資料探勘尋找影響太陽能模組製程良率之因子研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

隨著資訊科技的蓬勃發展與網路的普及化，人工智慧（Artificial Intelligence）技術不斷地精進，已延伸出許多機器學習（Machine Learning）和深度學習（Deep Learning）相關的智能化技術發展，諸如收集大量資訊的應用，像可協助客戶服務的機器人對答、電商平台常出現的商品自動推薦功能等。因此，為能迅速提供使用者在有大量文字的歷史案件中取得對應的分類主題參考，本研究將在人工智慧領域中，結合自然語言處理(Natural Language Processing)與機器學習演算法，探討文件自動化分類相關應用，找出適合文件分類的技術方案。
在資料集方面，本研究取用非結構化文字以及已有人工標記分類(Target Label)的文本資料。研究步驟包含文字前置處理、文字特徵擷取、分類模型與模型評估。研究方法則是藉由NLP模型方法將原始文本數據切割成最小單位的字詞後，分別使用文字特徵擷取技術（TF-IDF詞頻計算及Word2vec詞向量），Scikit-learn分類模型（貝氏分類、支持向量機、KNN演算法以及極限梯度提升—XGBoost）。
實驗設計經過10折交叉驗證，最終由Word2Vec特徵模型搭配XGBoost分類器所訓練出的分類模型優於其他模型的組合，平均預測分數（F1值）達到88.78%水準（10次執行結果範圍落在85.15%~90.78%之間）。

摘要(英)

With the development of information technology and the popularization of the Internet, artificial intelligence technology has been continuously improved, extending the development of many intelligent technologies related to machine learning or deep learning. Such as applications that collect a lot of information, such as robots that can help customer service answer calls, and automatic product recommendation functions that often appear on e-commerce platforms. Therefore, in order to quickly provide users with a reference to relevant taxonomic topics in historical case texts. This research is mainly in the field of artificial intelligence, combining natural language processing and machine learning techniques, to study and discuss the application of automatic text classification to determine the appropriate text classification feasible solution.
In the dataset part, this study uses unstructured text and human-labelled text data. The research steps includes text preprocessing, text feature extraction, classification model and model evaluation. The study method is to use the NLP model method to cut the original text data into the smallest word units, and then use the text feature extraction technology (TF-IDF and Word2vec), and scikit-learn classification models (Bayesian classification, Support vector machine, KNN algorithm and XGBoost).
Finally, the experimental results after 10-fold cross-validation show that the classification performance of Word2Vec feature model and XGBoost classifier training is better than the combination of other models, reaching an average F1 value of 88.78% (10-fold execution results range between 85.15% and 90.78%) .

關鍵字(中)

★ 自然語言處理
★ 監督式機器學習
★ TF-IDF
★ Word2vec
★ 文件分類
★ XGBoost

關鍵字(英)

★ Natural Language Processing
★ Supervised Machine Learning
★ TF-IDF
★ Word2vec
★ Text Classification
★ XGBoost

論文目次

摘要 i
Abstract ii
誌謝 iii
目錄 iv
表目錄 vi
圖目錄 vii
一、緒論 1
1-1研究動機 1
1-2研究背景 2
1-3研究目的及範圍 3
二、研究知識及相關文獻 4
2-1文件分類(Text Classification) 4
2-2文件分類相關文獻 5
2-3文字探勘(Text Mining) 7
三、研究方法 8
3-1研究架構 8
3-2資料前置處理 9
3-3文字特徵表示 10
3-4分類模型及評估 13
四、實驗設計與結果 19
4-1實驗準備 19
4-2數據集前置處理 21
4-3實驗一、針對文字特徵提取方法的比較 23
4-4實驗二、衡量資料質量對分類效果的影響 28
4-5實驗三、採用不同分類器的影響 36
五、結論 42
參考文獻 43
附錄 45
附錄一、本研究專有詞庫示意(部份) 45
附錄二、本研究使用Python套件列表 46
附錄三、本實驗數據集分類描述統計明細(4-1, 4-2) 47
附錄四、實驗一：文字特徵方法實驗結果 49
附錄五、實驗二：採用不同資料欄位組合的10K交叉驗證評估結果 51
附錄六、實驗三：採用不同分類方法的10K交叉驗證評估結果 53

參考文獻

﹝1﹞ ACL組織：〈ACL 2020 General Conference Statistics〉，2020年6月4日，取自https://acl2020.org/blog/general-conference-statistics/
﹝2﹞itread01.com：〈文件分類發展史〉，2019年2月15日，取自https://www.itread01.com/content/1550160753.html
〈文字特徵提取方法研究〉，2018年12月14日，取自https://www.itread01.com/content/1544736258.html
﹝3﹞OOSGA網站：〈NLP自然語言處理 – 技術原理與其產業應用〉，取自https://oosga.com/pillars/nlp/
﹝4﹞高欣群：〈健康資訊網站之中文醫療問題自動分類—以西醫為例〉。碩士論文，慈濟大學，民國97年10月
﹝5﹞李致寧：〈基於主旨輔以自然語言特徵之線上垃圾郵件偵測系統〉。碩士論文，國立交通大學，民國105年9月
﹝6﹞洪學儒：〈基於Word2Vec字詞向量模型之熱門主題偵測與命名方法〉。碩士論文，國立台北科技大學，民國106年7月
﹝7﹞石秀媖：〈以 word2vec 擴展關鍵字詞應用於商品名稱自動化分類〉。碩士論文，國立中央大學，民國107年6月
﹝8﹞陳虹霈：〈結合卷積神經網路與遞迴神經網路於自動文本分類〉。碩士論文，東吳大學，民國108年6月
﹝9﹞陳威達：〈應用機器學習演算法進行文本情感分析之研究〉。碩士論文，德明財經科技大學，民國109年6月
﹝10﹞維碁百科：〈Text Mining〉，取自https://en.wikipedia.org/wiki/Text_mining
﹝11﹞Zellig S. Harris：〈Distributional Structure. Word. 〉，1954年，取自https://www.tandfonline.com/doi/pdf/10.1080/00437956.1954.11659520
﹝12﹞easyAI網站：〈TF-IDF介紹(歷史/算法/變種)〉，2019年2月26日，取自https://easyai.tech/ai-definition/tf-idf/
﹝13﹞Mikolov, Kai Chen, Greg Corrado and Greg Corrado：〈Efficient Estimation of Word Representations in Vector Space〉，2013年9月7日，取自https://arxiv.org/pdf/1301.3781.pdf
﹝14﹞陳宜欣：〈從自然語言到文字探勘〉，2018年7月6日，取自https://www.slideshare.net/YiShinChen1/ss-104503736，slide P38-52
﹝15﹞Scikit-learn網站：〈Scikit-learn-機器學習地圖〉，取自http://scikit-learn.org/stable/_static/ml_map.png
﹝16﹞Corinna Cortes and Vladimir Vapnik：〈Support-vector networks〉，1995年，取自https://link.springer.com/content/pdf/10.1007/BF00994018.pdf
﹝17﹞陳天奇：〈XGBoost： A Scalable Tree Boosting System〉，2016年6月10日，取自https://www.kdd.org/kdd2016/papers/files/rfp0697-chenAemb.pdf
﹝18﹞SZ Dev：〈Ｋ折交叉驗證〉，2018年2月11日：取自https://www.szdev.com/blog/AI/model-selection-k-fold-cross-validation/

指導教授

曾富祥(Fu-Shiang Tseng)

審核日期

2022-1-17

推文