基於聲音驅動的End to end即時面部模型合成系統;Audio-driven End to End real-time facial model synthesis system

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/90054

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/90054

Title:	基於聲音驅動的End to end即時面部模型合成系統;Audio-driven End to End real-time facial model synthesis system
Authors:	胡峻愷;Hu, Jyun-Kai
Contributors:	資訊工程學系
Keywords:	Seq2Seq模型;Transformer輕量化;人臉合成;Sequence to Sequence;Lightweight Transformer;face synthesis
Date:	2022-09-26
Issue Date:	2022-10-04 12:09:30 (UTC+8)
Publisher:	國立中央大學
Abstract:	VR/AR作為一種新興技術，無論是教育、娛樂還是情景模擬，都有非常重要的應用。 VR 可以提供與實際空間環境相媲美的體驗。在模擬醫療手術，是軍事訓練，甚至心理諮詢時的畫面想像中是一個非常好的應用工具。製造業、建築業和旅遊業也可以在 VR 和 AR 的幫助下發生巨大的變化。例如，VR可以輕鬆實現工廠遠程監控、旅遊景點導覽，甚至應用於建築信息模型。用於工程建設項目的設計模擬、協同編輯、造價試算，而AR疊加現實場景中虛擬物體的特徵，可用於疊加設備運行、維護SOP，甚至空間內的管線圖信息、方位引導、物品歷史信息等，都可以為生產操作、各種設備的維護操作、消防救援、觀光引導等帶來極大的便利。而虛擬世界中，人的面部表情是極為重要的一環。人在處裡的外界資訊中，人臉占了大腦中相當分量的容量。人腦甚至有專門的區域負責處裡視覺訊號中面部表情的區塊。若是虛擬解色的面部處裡不夠逼真，很容易使VR使用者沉浸感降低，達不到VR/AR預期該有的效果。因此，投入資源模擬出逼真的虛擬人物面部模型，是相當有必要的。現有的面部捕捉技術，可以利用影像資訊，搭配各種感測器在虛擬世界中重建出原本的人物面部。這項技術已經縝緻成熟，建出以假亂真的模型，在各大動畫/遊戲/影視中已經被大量應用。然而，現有的技術，捕捉面部的器材成本卻也所費不貲。許多情境下，並沒有那麼多的資源可以使用，可傳輸的資料更加稀少。在這種情境下，利用深度學習，分析音訊中的文字以及對應情緒，重建和合成出虛擬角色該有的五官動作網格的技術，就能派上用場了。本論文基於前人提出的即時面部模型合成系統，利用輕量化的Transformer模型，在消耗更少量資源的前提下，使用語音訊息即時的分析出說話者嘴部該有的形狀，同時分析出語氣中隱含的情緒，調整面部模型其他部位諸如眉毛、眼睛和臉頰等部件的形狀。 ;As a novel technology, VR/AR has very important applications whether it is education, entertainment or scenario simulation. VR can provide an experience comparable to the actual spatial environment. It is a very good application tool in the image imagination of simulated medical surgery, military training, and even psychological consultation. Manufacturing, construction, and tourism can also be dramatically transformed with the help of VR and AR. For example, VR can easily implement remote monitoring of factories, tours of tourist attractions, and even applied to building information models. It is used for design simulation, collaborative editing, and cost trial calculation of engineering construction projects, while AR superimposes the characteristics of virtual objects in real scenes, which can be used to superimpose equipment operation, maintenance SOP, and even pipeline map information, orientation guidance, and item history information in space. , can bring great convenience to production operation, maintenance operation of various equipment, fire rescue, sightseeing guidance, etc. In the virtual world, human facial expressions are an extremely important part. Among the external information of human beings, the human face occupies a considerable amount of capacity in the brain. The human brain even has a dedicated area responsible for processing facial expressions in visual signals. If the virtual decolorization of the face is not realistic enough, it is easy to reduce the immersion of the VR user, and the expected effect of VR/AR cannot be achieved. Therefore, it is quite necessary to invest resources to simulate realistic facial models of virtual characters. Existing face capture technology can use image information and various sensors to reconstruct the original face of a character in the virtual world. This technology has matured and built a fake model, which has been widely used in major animations/games/films. However, with the existing technology, the cost of the equipment to capture the face is also very expensive. In many situations, there are not so many resources available, and the data that can be transmitted is even more scarce. In this situation, the use of deep learning to analyze the text and corresponding emotions in the audio, reconstruct and synthesize the technology of the facial features and action grids that the virtual character should have, can come in handy. Based on the real-time facial model synthesis system proposed by the predecessors, this paper uses the lightweight Transformer model to analyze the shape of the speaker′s mouth in real time and analyze the tone of the voice under the premise of consuming less resources. Implied emotions, adjust the shape of other parts of the face model such as eyebrows, eyes and cheeks.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	77	View/Open

社群 sharing

Loading...