擷取有效畫面域與時間域資訊進行深度學習手語辨識;Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/93375

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/93375

Title:	擷取有效畫面域與時間域資訊進行深度學習手語辨識;Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction
Authors:	蔡允齊;Tsai, Yun-Chi
Contributors:	資訊工程學系
Keywords:	手語辨識;關鍵幀;深度學習
Date:	2023-07-28
Issue Date:	2024-09-19 16:56:37 (UTC+8)
Publisher:	國立中央大學
Abstract:	基於深度學習的自動手語辨識需要大量視訊資料進行模型訓練，然而手語視訊的製作與蒐集相當費時繁瑣，少量或不夠多樣的資料集則限制了手語辨識模型的準確率。本研究針對手語辨識提出有效的空間域與時間域資料擷取方法，希望將有限的手語視訊資料透過合理的擴增處理產生更大量與多樣的訓練資料，這些做為深度學習網路的輸入資料可搭配較簡易的架構如3D-ResNet來搭建，可以不採用複雜或需要大量訓練資源的網路架構即可獲致相當的手語辨識效果。我們的空間域資料擷取採用以Mediapipe所取得的骨架、手部區域型態或遮罩，以及移動光流，這三種資料可做為像是較早的3D-ResNet模型所常採用的三通道輸入，但與以往RGB輸入不同的是我們的三種資料各有特點而讓特徵擷取更具效果。時間域資料擷取則透過計算與決定關鍵幀的方式挑選更有意義畫面，藉此達成不同的畫面選擇策略。我們所提出的時間域與空間域資料可再用有效的資料增強模擬多種手尺寸、手勢速度、拍攝角度等，對於擴充資料集與增加多樣性都有很大的助益。實驗結果顯示我們的方法對於常用的美國手語資料集有顯著的辨識準確度提升。;Automatic sign language recognition based on deep learning requires a large amount of video data for model training. However, the creation and collection of sign language videos are time-consuming and tedious processes. Limited or insufficiently diverse datasets restrict the accuracy of sign language recognition models. In this study, we propose effective spatial and temporal data extraction methods for sign language recognition. The goal is to augment the limited sign language video data to generate a larger and more diverse training dataset. The augmented data, used as inputs to deep learning networks, can be paired with simpler architectures like 3D-ResNet, which allows for achieving considerable sign language recognition performance without the need for complex or resource-intensive network structures. Our spatial data extraction employs three types of data: skeletons obtained using Mediapipe, hand region patterns or masks, and optical flows. These three data types can be used as three-channel inputs, akin to the approach often used in earlier 3D-ResNet models. Nevertheless, our distinct data types offer specific features that enhance feature extraction. For temporal data extraction, we determine certain key-frames to capture more meaningful visual information, thus employing different scene selection strategies. The proposed spatial and temporal data extraction methods facilitate data augmentation, which simulates various hand sizes, gesture speeds, shooting angles, etc. The strategy significantly contributes to expanding the dataset and increasing its diversity. Experimental results demonstrate that our approach significantly improves the recognition accuracy for commonly used American Sign Language datasets.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	21	View/Open

社群 sharing

Loading...