子計畫五：基於深度學習整合視訊及聲訊之偵測與描述;Integrated VI deo and Audio Events Detection and Description Based on Deep Learning Technique

NCU Institutional Repository > 資訊電機學院 > 通訊工程學系 > 研究計畫 > Item 987654321/78623

請使用永久網址來引用或連結此文件: http://ir.lib.ncu.edu.tw/handle/987654321/78623

題名:	子計畫五：基於深度學習整合視訊及聲訊之偵測與描述;Integrated VI deo and Audio Events Detection and Description Based on Deep Learning Technique
作者:	張寶基
貢獻者:	國立中央大學通訊工程系
關鍵詞:	多媒體內容描述;動作辨識;聲音場景分類;聲音事件偵測;深度學習;multimedia description;motivation recognition;acoustic scene classification;acoustic event detection;deep learning
日期:	2018-12-19
上傳時間:	2018-12-20 12:08:05 (UTC+8)
出版者:	科技部
摘要:	隨著多媒體監控技術發展，如何使人類能更便利的擷取影音訊息，是一個重要的議題。近年來，影音資訊內容搜尋的技術尋找其中的重點，除了可節省人力標記的成本，也能有效的自動化地擷取影音的主要特徵，以這些特徵進行參數的相似度比對，從影音資料庫中回傳此影片的訊息內容。因龐大的資料量與計算，使深度學習技術可利用大量資訊進行更接近人類複雜腦神經之學習，不僅有助於在影音方面的環境認知與動作辨識的發展，未來更能應用於監控系統，使監控系統更為完善，因此本計畫將利用深度學習之方法，分別學習影像與音訊特徵，比對聲訊與視訊內容，回傳影片所敘述之內容。近三年，本團隊執行科技部整合型計畫-智慧型影音內容分析、創作及推薦，對深度學習方法的音訊與影像檢索方面，有極為深入的了解。而本計畫也為期三年，首先將深度學習應用於原始聲訊及視訊上，分別分析其每層之特徵基底，進而了解聲訊及視訊形成的元素，聲訊上辨識出聲音場景，視訊方面辨識動作。第二年，基於過去所辨識的場景，聲訊上再更詳細的偵測聲訊於不同時間點的事件；視訊上運用視訊分類模型進行視訊描述，並改善架構。最後整合比對聲訊偵測與視訊描述之結果，描述出此影片的內容。 ;With the development of multimedia monitoring technology, how to more convenience and faster to catch information of audio and video, such as the events and actions in the video, shooting environments, and objects at around, is one of the research and application spotlights. The content based audio/video captures can efficiently and friendly to automate capture the main features of audio/video. Finally, these features are compared the similarity of parameters Then, return text content of audio/video from the database. Deep learning is a powerful technology in machine learning. Because of the huge amount of data and calculation, the technology makes the technique closer to the complex human brain study. This is beneficial to the development of environmental cognition and action identification in audio-visual aspects. Also, the project can apply in multimedia monitoring system in the future. Therefore, this project will apply the deep learning to obtain efficient feature representations. In the recent three years, our team has executed the integrated project of MOST, Intelligent Audio-visual Content Analysis, Authoring, and Recommendation. Therefore, we are very in-depth understand on the technology of deep learning which applied to audio and image. We plan to execute this project in three years. In the first year, we will apply deep learning to original audio/video data to analyze the basis of feature to understand the fundamentals of contents Then, we will classify acoustic scene in audio signal and identify motivation in visual signal. Furthermore, based on our previous work, we detect the acoustic events in terms of acoustic. On the other hand, we not only improve the structure of motivation recognition, but start to work for video captions in the same structure. Last year, our goal is to integrate result of acoustic event detection and video caption and make machine tell us what the information of video is.
關聯:	財團法人國家實驗研究院科技政策研究與資訊中心
顯示於類別:	[通訊工程學系] 研究計畫

文件中的檔案:

檔案	描述	大小	格式	瀏覽次數
index.html		0Kb	HTML	311	檢視/開啟

在NCUIR中所有的資料項目都受到原著作權保護.

社群 sharing

資料載入中.....