不平衡數據的機器學習發展暨可視化辨識模型之應用;Machine learning development of imbalanced data and application of visual recognition model

NCU Institutional Repository > 工學院 > 機械工程研究所 > 博碩士論文 > Item 987654321/81636

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/81636

题名:	不平衡數據的機器學習發展暨可視化辨識模型之應用;Machine learning development of imbalanced data and application of visual recognition model
作者:	許哲彰;Hsu, Che-Chang
贡献者:	機械工程學系
关键词:	重新平衡支持向量機;可視化辨識模型;多元尺度變換;SVM-rebalancing;visual recognition model;multidimensional scaling
日期:	2019-07-24
上传时间:	2019-09-03 16:34:34 (UTC+8)
出版者:	國立中央大學
摘要:	不平衡數據集在機器學習的許多應用場景中是一個普遍存在的問題。如何在訓練集的某些類擁有較多的樣本，而某些類只有相對較少的樣本情況下，解決傳統分類器對少類分類失準的問題已成為機器學習目前面臨的一個挑戰。本研究從算法層面(algorithm level)出發，提出一種結合貝葉斯分類器與支持向量機的新模型，即重新平衡支持向量機(SVM-rebalancing)。在這個學習過程中，重新平衡參數(分類權值參數)提供了一個使各類別的分類權值趨於平衡的協調，並藉由求解重新平衡規劃問題使少類樣本獲得有效的可識別性。本研究次要旨在瞭解造成錯誤分類的可能來源是否不僅是不平衡，還是尚有其他因素導致這些誤分類。鑒於模式識別的純預測模型缺乏可視化理解訊息，像類神經網路和支持向量機這樣的黑盒方法(black box)無法提供可解釋的模型，造成了對誤分類的原因無法探究其根源。因此，本研究提出對核函數進行多元尺度變換的前處理以來建構低維數據的表示空間。在實踐中，可視化辨識模型表明數據的重疊分布、多峰分布、偏態分布也是造成分類器的分類性能不佳的其他原因。最後，本研究給予一項建議是:採用這樣的可視化辨識模型策略能夠告訴我們數據結構所出現的問題，一旦想再繼續提升分類器的性能時就能往該方面進行後續改良。;Imbalanced data is a common problem in many application domains of machine learning. How to solve the problem of misclassification of minority class samples by traditional classifiers has become a challenge in machine learning when some classes of training set have more samples, and some classes have relatively few samples. This paper proposes a new model combining Bayesian classifier and support vector machine (SVM) from the perspective of algorithm level, namely, SVM-rebalancing. In the learning process, the rebalance parameter (classification weight parameter) provides a coordination that balances the classification weight of each class. The problem is solved by rebalancing programming problem, so as to produce an effective identifiability for minority samples. The next study wants to understand whether the possible sources of misclassifications are not only the imbalance, but also other factors that cause to these misclassifications. In view of the purely predictive model of pattern recognition lacks visual understanding, black box methods such as neural networks and support vector machines cannot provide interpretable model, which makes it impossible to explore the sources of misclassification causes. Therefore, this study further proposes a pre-processing of multidimensional scaling of kernel functions to construct a visual low-dimensional data representation space. In practice, the visual recognition model indicates that the overlapping distribution, multimodal distribution, and skewed distribution of the data in the database are also other causes of poor classification performance of the classifier. Finally, this research gives a suggestion that using such a visual identification model strategy can tell us the problems that arise in the data structure, and once we further want to improve the performance of the classifier, we can make subsequent improvements in this aspect.
显示于类别:	[機械工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	124	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....