中大機構典藏-NCU Institutional Repository-提供博碩士論文、考古題、期刊論文、研究計畫等下載:Item 987654321/95763
English  |  正體中文  |  简体中文  |  全文筆數/總筆數 : 80990/80990 (100%)
造訪人次 : 42715180      線上人數 : 1412
RC Version 7.0 © Powered By DSPACE, MIT. Enhanced by NTU Library IR team.
搜尋範圍 查詢小技巧:
  • 您可在西文檢索詞彙前後加上"雙引號",以獲取較精準的檢索結果
  • 若欲以作者姓名搜尋,建議至進階搜尋限定作者欄位,可獲得較完整資料
  • 進階搜尋


    請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/95763


    題名: 基於最佳傳輸條件流匹配之語音合成系統;OT-CFM Based Text to Speech Systems
    作者: 金珉旭;Jin, Min-Xyu
    貢獻者: 資訊工程學系
    關鍵詞: 深度學習;語音合成;流匹配
    日期: 2024-08-08
    上傳時間: 2024-10-09 17:15:26 (UTC+8)
    出版者: 國立中央大學
    摘要: 傳統語音合成方法主要依賴於統計參數語音合成或拼接式合成技術。這些方法依靠手動提取的語音特徵和繁雜的演算法合成語音,但缺乏自然度和情感,合成效果極差。自 2010 年代深度學習蓬勃發展之始,研究者開始探索使用深度神經網絡(DNN)提升合成語音的品質,時至今日,各式深度學習模型與演算法已完全取代傳統合成方法,生成媲美真人的語音。但當前的語音合成模型仍有以下缺點:訓練、推理速度稍慢,仍需耗費相當的時間成本;且生成自然流暢的語音已非難事,但往往缺乏情感變化,較為單調。

    本論文使用最佳傳輸條件流匹配生成模型構建一套語音合成系統,該模型能生成高自然度、高相似度的語音,並擁有高效的訓練及推理速度。本論文之語音合成系統包括以下兩種任務:多語言語音合成及中文情感語音合成。多語言語音合成系統使用 Carolyn、JSUT、Vietnamese Voice Dataset 三個資料集,建立支援中文、日文及越南文之語音合成系統。中文情感語音合成系統使用具有情感風格之中文資料集 ESD-0001,搭配預訓練wav2vec 情感風格提取器,用於提取訓練語音之情感特徵,使模型學習將資料集中之情感風格遷移至生成語音。
    ;Traditional speech synthesis methods mainly rely on statistical parametric speech synthesis or concatenative synthesis techniques. These methods depend on manually
    extracted speech features and complex algorithms to synthesize speech, but they lack naturalness and emotion, resulting in poor synthesis quality. Since the rise of deep
    learning in the 2010s, researchers have begun exploring the use of deep neural networks (DNN) to enhance the quality of synthesized speech. Today, various deep learning
    models and algorithms have completely replaced traditional synthesis methods, generating speech comparable to real human voices. However, current speech synthesis
    models still have the following drawbacks: training and inference speeds are somewhat slow, requiring considerable time costs; generating natural and fluent speech is no
    longer a challenge, but it often lacks emotional variation, resulting in a monotonous output.

    This paper constructs a speech synthesis system using an optimal transport conditional flow matching generative model, which can generate highly natural and
    similar speech while achieving efficient training and inference speeds. The speech synthesis system in this paper includes the following two tasks: multilingual speech
    synthesis and Chinese emotional speech synthesis. The multilingual speech synthesis system uses three datasets: Carolyn, JSUT, and Vietnamese Voice Dataset, to establish
    a speech synthesis system supporting Chinese, Japanese, and Vietnamese. The Chinese emotional speech synthesis system uses the ESD-0001 Chinese dataset with emotional
    style, along with a pre-trained wav2vec emotional style extractor, to extract emotional features from the training speech, allowing the model to learn to transfer the emotional styles from the dataset to the generated speech.
    顯示於類別:[資訊工程研究所] 博碩士論文

    文件中的檔案:

    檔案 描述 大小格式瀏覽次數
    index.html0KbHTML60檢視/開啟


    在NCUIR中所有的資料項目都受到原著作權保護.

    社群 sharing

    ::: Copyright National Central University. | 國立中央大學圖書館版權所有 | 收藏本站 | 設為首頁 | 最佳瀏覽畫面: 1024*768 | 建站日期:8-24-2009 :::
    DSpace Software Copyright © 2002-2004  MIT &  Hewlett-Packard  /   Enhanced by   NTU Library IR team Copyright ©   - 隱私權政策聲明