基於多尺度特徵與控制網路的潛在擴散模型達到姿態轉換任務;Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Computer Science and Information Engineering > Electronic Thesis & Dissertation > Item 987654321/95508

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/95508

Title:	基於多尺度特徵與控制網路的潛在擴散模型達到姿態轉換任務;Pose Transfer with Multi-Scale Features Combined with Latent Diffusion Model and ControlNet
Authors:	蘇嘉成;Cheng, Su Chia
Contributors:	資訊工程學系
Keywords:	擴散模型;姿態轉換;OpenPose;生成影像;Diffusion Models;Pose Transfer;OpenPose;Image Generation
Date:	2024-07-18
Issue Date:	2024-10-09 16:54:44 (UTC+8)
Publisher:	國立中央大學
Abstract:	近年來，生成式人工智慧的突出表現吸引了大量學者的研究興趣，在自然語言處理、圖像和音頻等領域掀起了一股熱潮。最為特別的是在圖像生成領域中，Diffusion Model 憑藉其卓越的性能在多個應用中取得了顯著的成果，如文生圖和圖生圖等。有鑑於此，本研究提出了一個全新的架構，使得 Diffusion Model 針對姿態轉換任務(Pose Transfer)擁有良好的表現，僅需憑藉參考圖和人體骨架圖即可實現精確的姿態轉換成果。然而，傳統的 Diffusion Model 是在像素級別上進行運算，來學習圖像特徵，這通常需要龐大的計算資源，僅僅是驗證模型的可行性和測試其性能就需耗時數日，對資源受限的研究單位而言，是一個重大的難題。為了解決這一瓶頸，本論文結合了 Latent Diffusion Model、ControlNet 和多尺度特徵擷取模組，並在注意力神經網路層中加入語意擷取濾波器，使得模型能夠專注於學習影像中最為重要的特徵和姿態之間的關係的同時，也降低運算資源，使得模型可以在RTX 4090 上有效地訓練。實驗結果表明，我們所提出的模型在硬體成本受限的情況下，能與其他基於 Diffusion Model 建構的模型匹敵，不只在姿態轉換準確度上有顯著地提升，也有效地減少了訓練以及圖像生成所耗費的時間。 ;In recent years, generative AI has become popular in areas like natural language processing, image, and audio, significantly expanding AI′s creative capabilities. Particularly in the realm of image generation, Diffusion Models have achieved remarkable success across various applications, such as image synthesis and transformation. Therefore, the present study introduces a new framework that enables Diffusion Models to perform effectively in pose transfer tasks, requiring only a reference image and a human skeleton diagram to achieve precise pose transformations. However, traditional Diffusion Models operate at the pixel level when learning image features, inevitably demanding substantial computational resources. For organizations with limited resources, merely validating the feasibility of the model and testing its performance could take days, which is a major challenge. To address this issue, this paper integrates the Latent Diffusion Model, ControlNet, and a multi-scale feature extraction module, and incorporates a semantic extraction filter into the attention neural network layer. This allows the model to focus on important image features and the relationships between poses, and the architecture can be effectively trained on an RTX 4090. Experimental results demonstrate that our proposed method can compete with other Diffusion Model-based approaches under resource constraints, significantly improving pose transfer accuracy and effectively reducing the time required for training and image generation.
Appears in Collections:	[Graduate Institute of Computer Science and Information Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	24	View/Open

社群 sharing

Loading...