基於多層次注意力機制之單目相機語意場景補全技術

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：23

、訪客IP：18.218.129.100

姓名

張鎮宇(Cheng-Yu Chang) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於多層次注意力機制之單目相機語意場景補全技術
(Monoscene camera semantic scene completion technique based on multi-level attention mechanisms)

相關論文

★ 以非監督式表徵分離學習之邊緣運算裝置低延遲樂曲中人聲轉換架構

★ 基於序列至序列模型之 FMCW雷達估計人體姿勢

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 ( 永不開放)

摘要(中)

現今科技快速發展的時代，硬體上持續的突破史的人工智慧的研究日益進展，需多的研究都逐漸有二維平面拓展到三維空間中，例如:自駕車產業、娛樂影視業，三維的人體建模、醫學美容等相關的領域。
在二維圖像中估計中人們可以很好的判斷場景的三維距離，但是在電腦視覺中，由單一的二維圖像推估三維場景一直以來都是一項值得關注的議題，因為人們能很快速地由圖像中辨識出物體並且能夠很好的預估物體的位置資訊，於是現今獲取三維空間資訊幾乎皆使用激光雷達或是深度相機，這些雖然能暫時解決三維空間資訊不足問題，但這些設備通常更加的昂貴且需要額外的輸入，因此由純視覺方法估計三維場景並且語意分割與補全語意意圖更好更快的解決場景理解的問題。
同時間在三維體素的場景在訓練與應用的階段會使用到大量的記憶體，因此如何在有限的資源限制之下能夠提高效能並重建三維場景也是在語意場景補全重要的一部份。
本研究將由單張RGB圖像重建三維場景並完成語意場景補全，在模型中加入注意力機制，對於不同尺度特徵在不同層級特徵對於使語意場景補全的影響，並提高語意場景補全模型的品質，減少訓練時間，並且分析在使用之記憶體與模型效能之間之優點。本研究在客觀的評估(IoU, mIoU)上皆有傑出的表現。

摘要(英)

In today′s era of rapid technological development, research on artificial intelligence has been making continuous breakthroughs in hardware. More and more studies are gradually expanding from two-dimensional planes to three-dimensional space, encompassing various fields such as the self-driving car industry, entertainment film industry, three-dimensional human modeling, and medical aesthetics.

While people are good at judging the three-dimensional distance of a scene in two-dimensional image estimation, estimating the three-dimensional scene from a single two-dimensional image has always been a matter of concern in computer vision. This is because people can quickly identify objects in images and accurately predict their locations. As a result, the acquisition of three-dimensional spatial information nowadays heavily relies on laser radar or depth cameras. Although these devices temporarily solve the problem of insufficient 3D spatial information, they are typically expensive and require additional inputs. Therefore, adopting a purely visual approach for estimating 3D scenes, along with semantic segmentation and complementary semantics, can better address the challenges of scene understanding.

Simultaneously, training and applying three-dimensional scenes require significant memory resources. Consequently, improving performance and reconstructing three-dimensional scenes with limited resources are crucial aspects of semantic scene complementation.
This study reconstructs 3D scenes from a single RGB image and completes the semantic scene complementation by adding an attention mechanism to the model and the importance of features at different levels to make it useful during training, in order to improve the quality of the semantic scene complementation model, reduce the training time, and investigate the advantages between the memory used and the model performance. This study shows outstanding performance in objective evaluation (IoU, mIoU).

關鍵字(中)

★ 語意場景補全
★ 注意力機制
★ 深度學習
★ 語意分割

關鍵字(英)

★ semantic scene completion
★ Attention mechanism
★ deep learning
★ semantic segmentation

論文目次

摘要 VI
Abstract VII
致謝 IX
目錄 X
圖目錄 XI
表目錄 XIII
第一章緒論 1
1-1 研究背景 1
1-2 研究動機與目的 2
1-3 論文架構 2
第二章語義場景補全相關介紹 3
2-1 語意場景補全概述 4
2-1-1 體素 4
2-1-2 點雲 5
2-1-3 隱式曲面 6
2-1-4 多邊形網格 (Mesh) 6
2-2 語義場景補全技術 7
2-2-1 輸入編碼 7
第三章影像分割與增強背景 13
3-1 U-Net介紹 13
3-2 注意力機制介紹 16
第四章實驗架構與設計 19
4-1 系統架構 19
4-2 訓練模型階段 21
4-2-1 Attention in U-Net 22
4-2-2 神經網路模型架構 23
4-2-3 損失函數 25
第五章實驗解果與分析 27
5-1 實驗環境與設定 27
5-2 實驗數據集 28
5-3 實驗評估方法 29
5-4 實驗解果與分析 31
第六章結論與未來展望 40
參考文獻 42

參考文獻

[1] S. Song, F. Yu, A. Zeng, A. X. Chang, M. Savva, and T. Funkhouser, “Semantic scene completion from a single depth image,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1746–1754, 2017.
[2] B. Graham, M. Engelcke, and L. Van Der Maaten, “3d semantic segmentation with submanifold sparse convolutional networks,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 9224–9232, 2018.
[3] C. Choy, J. Gwak, and S. Savarese, “4d spatio-temporal convnets: Minkowski convolutional neural networks,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 3075–3084, 2019.
[4] L.-C. Chen, G. Papandreou, I. Kokkinos, K. Murphy, and A. L. Yuille, “Deeplab: Semantic image segmentation with deep convolutional nets, atrous convolution, and fully connected crfs,” IEEE transactions on pattern analysis and machine intelligence, vol. 40, no. 4, pp. 834–848, 2017.
[5] O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, pp. 234–241, Springer, 2015.
[6] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for semantic segmentation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431–3440, 2015.
[7] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” arXiv preprint arXiv:1409.0473, 2014.
[8] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
[9] O. Oktay, J. Schlemper, L. L. Folgoc, M. Lee, M. Heinrich, K. Misawa, K. Mori, S. McDonagh, N. Y. Hammerla, B. Kainz, et al., “Attention u-net: Learning where to look for the pancreas,” arXiv preprint arXiv:1804.03999, 2018.
[10] M. Tan, Q. Le. EfficientNet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning, pp. 6105-6114. PMLR, 2019.
[11] X. Chen, K.-Y. Lin, C. Qian, G. Zeng, and H. Li, “3d sketch-aware semantic scene completion via semi-supervised structure prior,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 4193–4202, 2020.
[12] J. Li, Y. Liu, D. Gong, Q. Shi, X. Yuan, C. Zhao, and I. Reid, “Rgbd based dimensional decomposition residual network for 3d semantic scene completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 7693–7702, 2019.
[13] A.-Q. Cao and R. de Charette, “Monoscene: Monocular 3d semantic scene completion,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 3991–4001, 2022.
[14] L. Roldao, R. De Charette, and A. Verroust-Blondet, “3d semantic scene completion: A survey,” International Journal of Computer Vision, vol. 130, no. 8, pp. 1978–2005, 2022.
[15] J. Behley, M. Garbade, A. Milioto, J. Quenzel, S. Behnke, C. Stachniss, and J. Gall, “Semantickitti: A dataset for semantic scene understanding of lidar sequences,” in Proceedings of the IEEE/CVF international conference on computer vision, pp. 9297–9307, 2019.

指導教授

張寶基陳永芳(Pao-Chi Chang Yung-Fang Chen)

審核日期

2023-8-15

推文