基於時序卷積網路與注意力增強之目標語者語音萃取系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：41

、訪客IP：18.116.62.45

姓名

賴彥廷(Yen-Ting Lai) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於時序卷積網路與注意力增強之目標語者語音萃取系統
(Target Speaker Extraction System based on Temporal Convolutional Network with Attention Enhancement)

相關論文

★ 基於區域權重之衛星影像超解析技術	★ 延伸曝光曲線線性特性之調適性高動態範圍影像融合演算法
★ 實現於RISC架構之H.264視訊編碼複雜度控制	★ 基於卷積遞迴神經網路之構音異常評估技術
★ 具有元學習分類權重轉移網路生成遮罩於少樣本圖像分割技術	★ 具有注意力機制之隱式表示於影像重建三維人體模型
★ 使用對抗式圖形神經網路之物件偵測張榮	★ 基於弱監督式學習可變形模型之三維人臉重建
★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構	★ 基於序列至序列模型之 FMCW雷達估計人體姿勢
★ 基於多層次注意力機制之單目相機語意場景補全技術	★ 基於時序卷積網路之單FMCW雷達應用於非接觸式即時生命特徵監控
★ 視訊隨選網路上的視訊訊務描述與管理	★ 基於線性預測編碼及音框基頻週期同步之高品質語音變換技術
★ 基於藉語音再取樣萃取共振峰變化之聲調調整技術	★ 即時細緻可調性視訊在無線區域網路下之傳輸效率最佳化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2027-1-1以後開放)

摘要(中)

談話的交流是一項人類的重要溝通與社交方式，因此當我們利用錄音設備記錄這些聲音時，不免會常常遇到多人在交談的情境或是背景音中有其他人的聲音。對於人類聽覺系統來說，我們尚能從中辨別出我們想聽取的聲音，並不會因他人干擾而受到阻礙；但對於自動語音辨識系統來說，有多人談話與干擾的情境便會使辨識率大幅下降。因此需要搭配一些前處理，像是語音分離或是目標語音萃取的方式來將單人的聲音劃分出來，以便後續進行辨識。且受益於深度學習的發展，劃分聲音的品質有顯著的提升。
本篇論文的目的為目標語者的語音萃取。要完成此目的，需一主要萃取聲音的系統與輔助提供主系統目標訊息的子系統。我們提出的方法為以時序卷積網路 (Temporal Convolutional Network, TCN) 為主體之模型用以萃取聲音。時序卷積網路使卷積神經網路 (Convolutional Neural Network, CNN) 也能具有時序序列建模的能力以及其架構能調整成任何長度，具備高自由度。搭配具有注意力增強功能的子系統，提供充足有效的目標訊息，能使模型更有效地估測遮罩，從而提升語音品質，以實現有效的目標語者之語音萃取。

摘要(英)

Talking with other people is a kind of important human communication and social method. When we record the sounds of conversation, we may face multi-talker situation. For human hearing system, we human have ability to distinguish out the sound we want to focus on, but for Automatic Speech Recognition (ASR) system, the interference may reduce the accuracy seriously. It is necessary to apply preprocessing mechanism (e.g., speech separation, target speaker extraction) before ASR system, in order to separate each talker’s speech. In recent, the quality of separated speech quality is benefit from the development of deep learning.
In this paper, our goal is target speaker extraction. We need a main system for speech extraction and an auxiliary subsystem for providing target information. We adopt a Temporal Convolutional Network (TCN) architecture as speech extraction model. TCN make Convolutional Neural Network (CNN) can deal with time series modeling, and can be constructed in different model length. We import attention enhancement to auxiliary subsystem. This let subsystem provide rich and efficient target information to speech extraction model and make the model estimate mask better. With the better mask, the quality of the target speaker extraction can be improve.

關鍵字(中)

★ 深度學習
★ 目標語者語音萃取
★ 時序卷積網路

關鍵字(英)

★ deep learning
★ target speaker extraction
★ Temporal Convolutional Network

論文目次

目錄
摘要 i
Abstract ii
誌謝 iii
目錄 v
圖目錄 vii
表目錄 ix
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 3
1-3 論文架構 5
第二章深度學習相關介紹 6
2-1 卷積神經網路 7
2-1-1 卷積層 8
2-1-2 批次正規化與激活函數 9
2-1-3 池化層 15
2-1-4 全連接層 16
2-2 遞迴神經網路 17
2-2-1 遞迴神經網路介紹 18
2-2-2 長短期記憶網路 20
第三章　目標語音萃取相關介紹 23
3-1 時頻域之遮罩分離方法 23
3-2 基於深度學習方法之語音分離技術 25
3-2-1 基於時頻域之模型 25
3-2-2基於時域之模型 27
3-2-3 技術面臨之問題探討 29
第四章提出之架構 32
4-1 系統架構 32
4-2 訓練模型階段 33
4-2-1 語音資料集前處理 34
4-2-2 音訊編碼器 35
4-2-3 語者編碼器 37
4-2-4 語者萃取器 42
4-2-5 音訊解碼器 46
4-2-6 損失函數 47
4-4 測試階段 50
第五章實驗結果與分析討論 51
5-1 實驗環境介紹 51
5-2 實驗結果比較與討論 52
5-2-1 注意力增強機制以不同來源計算之比較 52
5-2-2提出之系統與其他方法之比較 53
5-2-3縮短延遲方法之效果比較 55
5-2-4不同數目語者之效果比較 56
第六章結論與未來展望 58
參考文獻 59

參考文獻

[1] Y. Lecun, L. Bottou, Y. Bengio and P. Haffner, “Gradient-based learning applied to document recognition,” in Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324, Nov. 1998.
[2] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” In Proceedings of The 32nd International Conference on Machine Learning, pages 448–456, 2015.
[3] S.Hochreiter and J.Schmidhuber, “Long short-term memory,” Neural computation, 9(8):1735–1780, 1997.
[4] Z. Chen, Y. Luo and N. Mesgarani, "Deep attractor network for single-microphone speaker separation," 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), New Orleans, LA, 2017, pp. 246-250.
[5] Patricia K. Kuhl, “Human adults and human infants show a perceptual magnet effect,” Perception & psychophysics, 50.2 (1991): 93-107.
[6] Y. Luo and N. Mesgarani, “Real-time single-channel dereverberation and separation with time-domain audio separation network.” in Interspeech, 2018, pp. 342–346.
[7] D. Yu, M. Kolbak, Z.-H. Tan, and J. Jensen, "Permutation invariant training of deep models for speaker-independent multi-talker speech separation," in Proceedings of ICASSP, pp. 241-245, 2017.
[8] N. Takahashi, S. Parthasaarathy, N. Goswami, and Y. Mitsufuji, “Recursive Speech Separation for Unknown Number of Speakers,” in Proc. Interspeech, 2019, pp. 1348–1352.
[9] X. Xiao et al., "Single-channel Speech Extraction Using Speaker Inventory and Attention Network," ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 86-90.
[10] Meng Ge, Chenglin Xu, Longbiao Wang, Eng Siong Chng, Jianwu Dang and Haizhou Li, "SpEx+: A Complete Time Domain Speaker Extraction Network", in Proc. of INTERSPEECH 2020, pp 1406-1410.
[11] V. Panayotov, G. Chen, D. Povey and S. Khudanpur, "Librispeech: An ASR corpus based on public domain audio books," 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206-5210.

[12] Ba J L, Kiros J R, Hinton G E. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
[13] K. He, X. Zhang, S. Ren and J. Sun, "Deep Residual Learning for Image Recognition," 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016, pp. 770-778.
[14] Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
[15] J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR – Half-baked or Well Done?“, ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Brighton, UK, 2019, pp. 626-630.
[16] E Colin Cherry, “Some experiments on the recognition of speech, with one and with two ears,” The Journal of the acoustical society of America, vol. 25, no. 5, pp. 975–979, 1953.
[17] C. Xu, W. Rao, E. S. Chng, and H. Li, “Optimization of speaker extraction neural network with magnitude and temporal spectrum approximation loss,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6990–6994.
[18] Chenglin Xu, Wei Rao, Eng Siong Chng and Haizhou Li, "SpEx: Multi-Scale Time Domain Speaker Extraction Network," in IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1370-1384, 2020.

指導教授

張寶基(Pao-Chi Chang)

審核日期

2022-1-19

推文