擷取有效畫面域與時間域資訊進行深度學習手語辨識;Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction

NCU Institutional Repository > 資訊電機學院 > 資訊工程研究所 > 博碩士論文 > Item 987654321/93375

jsp.display-item.identifier=請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/93375

题名:	擷取有效畫面域與時間域資訊進行深度學習手語辨識;Enhancing Deep-Learning Sign Language Recognition through Effective Spatial and Temporal Information Extraction
作者:	蔡允齊;Tsai, Yun-Chi
贡献者:	資訊工程學系
关键词:	手語辨識;關鍵幀;深度學習
日期:	2023-07-28
上传时间:	2024-09-19 16:56:37 (UTC+8)
出版者:	國立中央大學
摘要:	基於深度學習的自動手語辨識需要大量視訊資料進行模型訓練，然而手語視訊的製作與蒐集相當費時繁瑣，少量或不夠多樣的資料集則限制了手語辨識模型的準確率。本研究針對手語辨識提出有效的空間域與時間域資料擷取方法，希望將有限的手語視訊資料透過合理的擴增處理產生更大量與多樣的訓練資料，這些做為深度學習網路的輸入資料可搭配較簡易的架構如3D-ResNet來搭建，可以不採用複雜或需要大量訓練資源的網路架構即可獲致相當的手語辨識效果。我們的空間域資料擷取採用以Mediapipe所取得的骨架、手部區域型態或遮罩，以及移動光流，這三種資料可做為像是較早的3D-ResNet模型所常採用的三通道輸入，但與以往RGB輸入不同的是我們的三種資料各有特點而讓特徵擷取更具效果。時間域資料擷取則透過計算與決定關鍵幀的方式挑選更有意義畫面，藉此達成不同的畫面選擇策略。我們所提出的時間域與空間域資料可再用有效的資料增強模擬多種手尺寸、手勢速度、拍攝角度等，對於擴充資料集與增加多樣性都有很大的助益。實驗結果顯示我們的方法對於常用的美國手語資料集有顯著的辨識準確度提升。;Automatic sign language recognition based on deep learning requires a large amount of video data for model training. However, the creation and collection of sign language videos are time-consuming and tedious processes. Limited or insufficiently diverse datasets restrict the accuracy of sign language recognition models. In this study, we propose effective spatial and temporal data extraction methods for sign language recognition. The goal is to augment the limited sign language video data to generate a larger and more diverse training dataset. The augmented data, used as inputs to deep learning networks, can be paired with simpler architectures like 3D-ResNet, which allows for achieving considerable sign language recognition performance without the need for complex or resource-intensive network structures. Our spatial data extraction employs three types of data: skeletons obtained using Mediapipe, hand region patterns or masks, and optical flows. These three data types can be used as three-channel inputs, akin to the approach often used in earlier 3D-ResNet models. Nevertheless, our distinct data types offer specific features that enhance feature extraction. For temporal data extraction, we determine certain key-frames to capture more meaningful visual information, thus employing different scene selection strategies. The proposed spatial and temporal data extraction methods facilitate data augmentation, which simulates various hand sizes, gesture speeds, shooting angles, etc. The strategy significantly contributes to expanding the dataset and increasing its diversity. Experimental results demonstrate that our approach significantly improves the recognition accuracy for commonly used American Sign Language datasets.
显示于类别:	[資訊工程研究所] 博碩士論文

文件中的档案:

档案	描述	大小	格式	浏览次数
index.html		0Kb	HTML	21	检视/开启

在NCUIR中所有的数据项都受到原著作权保护.

社群 sharing

数据加载中.....