非平行語料庫基於生成注意力網路之語音轉換技術;Spectrum and Prosody Transformation for Non-parallel Voice Conversion with Generative Attentional Networks

NCU Institutional Repository > 資訊電機學院 > 通訊工程研究所 > 博碩士論文 > Item 987654321/86332

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/86332

題名:	非平行語料庫基於生成注意力網路之語音轉換技術;Spectrum and Prosody Transformation for Non-parallel Voice Conversion with Generative Attentional Networks
作者:	邱則維;Chiu, Tse-Wei
貢獻者:	通訊工程學系
關鍵詞:	語音轉換;生成對抗網路;注意力機制;非平行語料庫;Voice conversion;Generative Adversarial Networks;Attention;Non-parallel data
日期:	2021-07-19
上傳時間:	2021-12-07 12:34:03 (UTC+8)
出版者:	國立中央大學
摘要:	音轉換(Voice Conversion, VC)是一種較為複雜的技術，其目的為將原始語者的音色和音調做轉換，並保留語音內容，讓輸出後的結果聽起來像是目標語者所講出的。本篇論文使用了非平行的語料庫作為訓練數據，並提出加入注意力機制的循環生成對抗網路 (Cycle Generative Adversarial Network, Cycle-GAN) 用於語音轉換上，在轉換過程中能對不同語者特徵上的差異給予更多的權重，讓轉換時更能針對差異的地方做轉換，並保留較相似的片段。我們在架構中加入注意力模塊，並加入了新的損失函數用來更新網路。由於訓練生成對抗網路時會遇到不穩定的問題，因此我們針對鑑別器的損失函數部分，對真實樣本與生成後的樣本鑑別時給予不同的權重來改善。上述方法我們用於轉換頻譜包絡(音色)上，但我們也針對基本頻率(音調)嘗試使用生成對抗網路做轉換，並與原先轉換的方法做分析比較。最後從實驗結果表明在梅爾倒譜失真(Mel-Cepstral distortion, MCD)與平均意見分數(Mean Opinion Score, MOS)中，我們所提出語音轉換架構較基線系統好。 ;Voice Conversion (VC) is a complex technology designed to convert the pitch and timbre of the original speaker and preserve the speech content, let the output sounds like what the target speaker said. This paper uses non-parallel corpus as training data, and proposes a Cycle Generation Adversarial Network (Cycle-GAN) with attention mechanisms for voice conversion, which can give more weight to differences in the characteristics of different speakers during the transformation process, so that the conversion can be made more closely to the differences, and some similarities are retained. We added attention modules to the architecture and new loss functions to update the network. Because we often encounter unstable problems in training GAN, we give different weights to real and generated samples for the loss function part of the discriminator. The above methods are used to transform the spectrum envelope, but we also try to convert using the GAN for the fundamental frequency and compare it with the original conversion method. Finally, the experimental results show that in Mel-Cepstral distortion (MCD) and Mean Opinion Score (MOS), we proposed voice conversion architecture is better than the baseline system.
顯示於類別:	[通訊工程研究所] 博碩士論文

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	105	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....