摘要: | 這項工作旨在為人工智能領域的幾個問題的發展做出貢獻,包括語音情緒辨識 (SER)、聲學場景分類 (ASC) 和基於内容的影像檢索 (CBIR)。 這些問題來自各個領域,並有許多實際應用。例如,SER 可用於人機交互和心理保健,而 ASC 有助於了解周圍環境,這對於機器人導航、情境感知和監控應用非常有用。CBIR 涉及根據給定的查詢影像識別數據庫中的相關影像,可用於各種類型的影像檢索。 在本論文中,我們提出了使用深度神經網絡 (DNN) 來解決這些問題的方法。 具體來說,我們針對 SER 問題開發了一種簡單而有效的數據增強 (DA) 方法。 由於數據稀缺和標籤模糊,SER 很困難,DNN 模型容易過度擬合,這會導致測試數據泛化能力差。我們的 DA 方法創建的新數據樣本可能比原始數據樣本噪聲更大或模糊性更低,並且在我們對兩個公共數據集的實驗中,它證明了優於其他 DA 方法。 在 ASC 中,我們關注在跨設備設置中使用 DNN 模型時性能下降的問題,其中訓練和測試數據使用不同的設備記錄。我們提出了一個具有兩種 DA 方法的 ASC 系統:用於減少域間隙的 MixStyleFreq 和用於減輕 DNN 對主導設備的偏差的頻譜校正。 與其他 DA 方法相比,這些方法顯著提高了泛化性能,並取得了有競爭力的結果。 最後,我們針對 CBIR 中的美容產品影像檢索問題開發了一個完全端到端的 DNN 模型。 該模型不需要手動特徵聚合或後處理,在 Perfect-500K 數據集上的實驗結果顯示了其有效性和高檢索精度。 ;The work aims to contribute to the development of several problems in the field of artificial intelligence, including speech emotion recognition (SER), acoustic scene classification (ASC), and content-based image retrieval (CBIR). These problems come from various domains and have many practical applications. For example, SER can be used in human-machine interaction and mental healthcare, while ASC helps to understand the surrounding environment, which is useful for robot navigation, context awareness, and surveillance applications. CBIR involves identifying relevant images in a database based on a given query image, and can be used in various types of image search. In this thesis, we propose approaches using deep neural networks (DNNs) to address these problems. Specifically, we develop a simple yet effective data augmentation (DA) method for the SER problem. SER is difficult due to the scarcity of data and ambiguity of labels, and DNN models are prone to overfitting, which can lead to poor generalization on test data. Our DA method creates new data samples that may be noisier or less ambiguous than the original ones, and in our experiments with two public datasets, it demonstrates superiority over other DA methods. In ASC, we focus on the problem of performance degradation when DNN models are used in a cross-device setting, where the train and test data are recorded using different devices. We propose an ASC system with two DA methods: MixStyleFreq to reduce domain gaps, and spectrum correction to mitigate the bias of DNNs toward dominant devices. These methods significantly improve the generalization performance compared to other DA methods and achieve competitive results. Finally, we develop a fully end-to-end DNN model for the beauty product image retrieval problem in CBIR. This model requires no manual feature aggregation or post-processing, and experimental results on the Perfect-500K dataset show its effectiveness with high retrieval accuracy. |