特徵屬性篩選對於不同資料類型之影響

NCU Institutional Repository > 管理學院 > 資訊管理學系碩士在職專班 > 博碩士論文 > Item 987654321/74799

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/74799

题名:	特徵屬性篩選對於不同資料類型之影響
作者:	歐先弘;Leo, Hsien-Hung
贡献者:	資訊管理學系在職專班
关键词:	資料探勘;特徵屬性篩選;分類演算法;Data Mining;Feature Selected;Classification Algorithm
日期:	2017-08-21
上传时间:	2017-10-27 14:39:41 (UTC+8)
出版者:	國立中央大學
摘要:	特徵屬性篩選(Feature Selection)在資料探勘裡，是很重要的資料前處理步驟，主要目的是希望在給定一個資料集時，可以透過特徵選取技術來去除不相關或是冗餘的特徵值，從目前現有相關文獻中，並沒有針對每一類特徵屬性篩選，與三種不同的資料類型(數值型、離散型、混合型)進行實驗，因此本研究選定了三種特徵屬性篩選技術：資訊獲利(Information Gain, GA)、基因演算法(Genetic Algorithm, GA)、決策樹(Decision Tree, DT)，探討在這三種類型的未篩選與特徵屬性篩選下，在不同類型的資料集當中的分類表現，從UCI取得真實世界不同領域的40個資料集，實驗結果會在分類器：支持向量機 (Support Vector Machines, SVM)、最近鄰居法(K-Nearest Neighbor, KNN)、決策樹(Decision Tree, DT)、類神經網路(Artificial Neural Network, ANN)、AdaBoost、Bagging上進行驗證，希望透過正確率表現，探討出哪種特性的資料集透過哪種特徵屬性篩選，會提升某分類器演算法的效能，做為分析人員在進行實驗時的參考。依據研究所得之結果，離型散資料不論使用哪一種單一分類器或是Adaboost的分類演算法，其基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟；離散型資料使用Bagging多重分類器下選擇KNN分類器，經過DT特徵屬性篩選演算法後，其正確率會較執行其它演算法較佳；混合型資料除了IG特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料中除了GA特徵屬性篩選演算法，透過GA或是DT 特徵屬性篩選演算法，其正確率會比基準較佳；數值型資料在MLP的基準正確率表現最佳，建議不需再進行特徵屬性篩選步驟。針對不同資料類型，在選定分類器之後，可參考本研究挑選正確率最佳的特徵屬性篩選方法優先進行。;Feature selection is an important process for pattern recognition applications. The purpose of feature selection is to avoid classifier’s performance degradation. The removed feature(s) must be redundant, irrelevant, or of the least possible use. There is no related study which compares different feature selection methods with different data types, such as categorical, numerical, and mixed-type of datasets for classification performance. Therefore, in this thesis, three major feature selection methods were chosen, which are Information Gain (IG), Genetic Algorithm (GA) and Decision Tree (DT), and the research aim is to compare the classification accuracy of using these feature selection methods over different types of datasets. We illustrate the capability of the result by extensive experiments on analyzing 40 real-world datasets from UCI. In addition, six different classification techniques are compared, including Support Vector Machines (SVM), K-Nearest Neighbor (KNN), Decision Tree (DT), Artificial Neural Network (ANN), AdaBoost and Bagging. The experimental results show that the need for feature selection over categorical datasets is not strong. However, bagging based KNN and DT could increase the performance. For the mixed-type and numerical datasets, using GA and DT perform better. Particularly, if MLP is used, there is no need to do the feature selection process for numerical datasets. We demonstrate that different feature selection methods could increase the accuracy of some classification models.
显示于类别:	[資訊管理學系碩士在職專班 ] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	350	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....