結合自然語言處理及機器學習技術，探討文件分類之應用;Combine Natural Language Processing and Machine Learning Technology Explore the Application of Automatic Text Classification

NCU Institutional Repository > 管理學院 > 工業管理研究所碩士在職專班 > 博碩士論文 > Item 987654321/88091

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/88091

題名:	結合自然語言處理及機器學習技術，探討文件分類之應用;Combine Natural Language Processing and Machine Learning Technology Explore the Application of Automatic Text Classification
作者:	曾米嬪;Tseng, Mi-Ping
貢獻者:	工業管理研究所在職專班
關鍵詞:	自然語言處理;監督式機器學習;TF-IDF;Word2vec;文件分類;XGBoost;Natural Language Processing;Supervised Machine Learning;TF-IDF;Word2vec;Text Classification;XGBoost
日期:	2022-01-17
上傳時間:	2022-07-13 17:57:51 (UTC+8)
出版者:	國立中央大學
摘要:	隨著資訊科技的蓬勃發展與網路的普及化，人工智慧（Artificial Intelligence）技術不斷地精進，已延伸出許多機器學習（Machine Learning）和深度學習（Deep Learning）相關的智能化技術發展，諸如收集大量資訊的應用，像可協助客戶服務的機器人對答、電商平台常出現的商品自動推薦功能等。因此，為能迅速提供使用者在有大量文字的歷史案件中取得對應的分類主題參考，本研究將在人工智慧領域中，結合自然語言處理(Natural Language Processing)與機器學習演算法，探討文件自動化分類相關應用，找出適合文件分類的技術方案。在資料集方面，本研究取用非結構化文字以及已有人工標記分類(Target Label)的文本資料。研究步驟包含文字前置處理、文字特徵擷取、分類模型與模型評估。研究方法則是藉由NLP模型方法將原始文本數據切割成最小單位的字詞後，分別使用文字特徵擷取技術（TF-IDF詞頻計算及Word2vec詞向量），Scikit-learn分類模型（貝氏分類、支持向量機、KNN演算法以及極限梯度提升—XGBoost）。實驗設計經過10折交叉驗證，最終由Word2Vec特徵模型搭配XGBoost分類器所訓練出的分類模型優於其他模型的組合，平均預測分數（F1值）達到88.78%水準（10次執行結果範圍落在85.15%~90.78%之間）。 ;With the development of information technology and the popularization of the Internet, artificial intelligence technology has been continuously improved, extending the development of many intelligent technologies related to machine learning or deep learning. Such as applications that collect a lot of information, such as robots that can help customer service answer calls, and automatic product recommendation functions that often appear on e-commerce platforms. Therefore, in order to quickly provide users with a reference to relevant taxonomic topics in historical case texts. This research is mainly in the field of artificial intelligence, combining natural language processing and machine learning techniques, to study and discuss the application of automatic text classification to determine the appropriate text classification feasible solution. In the dataset part, this study uses unstructured text and human-labelled text data. The research steps includes text preprocessing, text feature extraction, classification model and model evaluation. The study method is to use the NLP model method to cut the original text data into the smallest word units, and then use the text feature extraction technology (TF-IDF and Word2vec), and scikit-learn classification models (Bayesian classification, Support vector machine, KNN algorithm and XGBoost). Finally, the experimental results after 10-fold cross-validation show that the classification performance of Word2Vec feature model and XGBoost classifier training is better than the combination of other models, reaching an average F1 value of 88.78% (10-fold execution results range between 85.15% and 90.78%) .
顯示於類別:	[工業管理研究所碩士在職專班 ] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	83	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....