中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/95829
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42695918      線上人數 : 1452
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95829


    題名: 基於窗注意力和信心融合的聽視覺語音辨識;Audio-Visual Speech Recognition using Window Attention and Confidence Mechanism
    作者: 鍾程洋;Chung, Cheng-Yang
    貢獻者: 資訊工程學系
    關鍵詞: 聽視覺語音辨識;語音處理;多模態模型;Audio-Viusal Speech Recognition;Speech processing;MultiModal
    日期: 2024-08-19
    上傳時間: 2024-10-09 17:19:02 (UTC+8)
    出版者: 國立中央大學
    摘要: Cocktail Party Effect是一種生物心理學上的現象,指的是當人處於嘈雜環境中,
    大腦能夠選擇性地專注於感興趣的聲音,並忽略其他背景噪音(例如人聲、冷氣聲及汽車
    喇叭聲等等)。這種自然的多模態感知能力使人類能夠在複雜的聲音環境中辨識和理解
    特定的語音訊息。在當今科技飛速發展的時代,多模態語音辨識技術成為人機交互界面
    中不可或缺的一環。由於單一模態的語音辨識系統在特定條件下可能面臨到一系列挑戰,
    例如嘈雜的環境、語速變化、以及無法辨識口型等問題。為了克服這些挑戰,近期許多
    研究主要探討多模態的聽視覺語音辨識。本論文”基於窗注意力和信心融合的聽視覺語
    音辨識”透過修改現有的多模態語音辨識模型架構,目的在於改進現有的融合方法,並
    且透過深度學習技術提升聽視覺語音辨識技術在高噪音環境下的強健性。我們透過修改
    Attention 機制,使得模型能夠在計算注意力分數時也一併考量輸入的噪音程度,從而產
    生更強健的模態特定特徵表示。;The Cocktail Party Effect is a phenomenon in biopsychology where the brain can
    selectively focus on sounds of interest while ignoring other background noise in noisy
    environments. This natural multimodal perception ability allows us to effectively recognize and
    understand specific speech information in complex auditory environments. In today′s rapidly
    advancing technological era, multimodal speech recognition technology has become an
    indispensable part of human-computer interaction interfaces. Single-modal speech recognition
    systems face a series of challenges under certain conditions, such as noisy environments,
    varying speech rates, and the inability to recognize lip movements. These challenges are akin
    to the Cocktail Party Effect, where the human brain can selectively focus on sounds of interest.
    To overcome these challenges, this thesis, titled " Enhancing Noise Robustness in Audio-Visual
    Speech Recognition with Window Attention and Confidence Mechanisms" aims to enhance the
    integration of audio and visual information by modifying the existing multimodal speech
    recognition model architecture. By utilizing deep learning techniques, this approach brings a
    new perspective to lip-reading and speech recognition technology. We have modified the
    attention mechanism to enable the model to dynamically perceive the noise level of input
    modality features, thereby generating more robust modality-specific feature representations.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML41檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明