基於注意力機制的孿生網路之視覺追蹤

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：38

、訪客IP：3.135.198.49

姓名

葉志恩(Chih-En Yeh) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

基於注意力機制的孿生網路之視覺追蹤
(Attention Mechanism Based Siamese Networks for Visual Tracking)

相關論文

★ 應用於車內視訊之光線適應性視訊壓縮編碼器設計	★ 以粒子濾波法為基礎之改良式頭部追蹤系統
★ 應用於空間與CGS可調性視訊編碼器之快速模式決策演算法	★ 應用於人臉表情辨識之強健式主動外觀模型搜尋演算法
★ 結合Epipolar Geometry為基礎之視角間預測與快速畫面間預測方向決策之多視角視訊編碼	★ 基於改良式可信度傳遞於同質區域之立體視覺匹配演算法
★ 以階層式Boosting演算法為基礎之棒球軌跡辨識	★ 多視角視訊編碼之快速參考畫面方向決策
★ 以線上統計為基礎應用於CGS可調式編碼器之快速模式決策	★ 適用於唇形辨識之改良式主動形狀模型匹配演算法
★ 以運動補償模型為基礎之移動式平台物件追蹤	★ 基於匹配代價之非對稱式立體匹配遮蔽偵測
★ 以動量為基礎之快速多視角視訊編碼模式決策	★ 應用於地點影像辨識之快速局部L-SVMs群體分類器
★ 以高品質合成視角為導向之快速深度視訊編碼模式決策	★ 以運動補償模型為基礎之移動式相機多物件追蹤

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

至系統瀏覽論文 (2024-8-1以後開放)

摘要(中)

近年來，基於孿生網路（Siamese networks）之追蹤方案，大多採用互相關（cross-correlation）計算目標物模板與搜索畫面中各個區域的相似度，並透過分類（classification）網路與迴歸（regression）網路分別預測目標物之位置與邊界框（bounding boxes）之座標。然而，由互相關產生之分數圖（score map）僅能大致呈現目標物的所在位置，無法精確反映出目標物的主要語意特徵，而分類網路與迴歸網路之間缺乏交流機制，導致分類結果無法正確反映網路所預測的邊界框準確性。因此，本論文提出基於注意力機制（attention mechanism）以及殘差連接（residual connection）的特徵強化模組，並進而應用於基於孿生網路的物件追蹤器之單向與雙向（bi-directional）的特徵強化，其中，單向模組用於取代互相關運算，使追蹤器得以利用具有語意訊息之特徵進行更準確的邊界框預測，雙向模組則用於分類網路與迴歸網路產生之特徵映射（feature embedding）相互進行聚合與強化，使兩者能夠交流資訊，並於訓練階段能間接由彼此之損失函數（loss function）輔助學習。本論文於大型追蹤平台GOT-10k及LaSOT進行測試，實驗結果顯示所提出之追蹤器相較於最先進之方案，在短期與長期追蹤上能兼顧準確率與追蹤速度（67 FPS）。

摘要(英)

In recent years, cross-correlation has been used in most Siamese-based trackers for similarity measuring between a target template and a search region, where a classification network and a regression network are adopted for target localization and bounding box prediction, respectively. However, the score map generated by cross-correlation can only approximate the target location, failing to represent semantic information of the target. The lack of communication mechanism between the classification and regression networks results in the misalignment between the classification results and the precision of the predicted bounding boxes. Thus, this paper proposes an attention mechanism based module with residual connection for unidirectional and bi-directional feature enhancement in Siamese-based trackers. The unidirectional module is used to replace cross-correlation, making the trackers able to predict more precise bounding boxes with semantic information. The bi-directional module aggregates and enhances the feature embedding generated by both classification and regression networks reciprocally, hence the two networks can exchange information and be optimized indirectly with the loss functions of each other during the training phase. Experimental results on benchmarks including GOT-10k and LaSOT show that the proposed scheme has balance between tracking accuracy and speed (67 FPS) compared to state-of-the-art trackers on both long-term and short-term tracking.

關鍵字(中)

★ 視覺追蹤
★ 孿生網路
★ 注意力機制
★ 特徵聚合

關鍵字(英)

★ Visual tracking
★ Siamese networks
★ attention mechanism
★ feature aggregation

論文目次

摘要 VII
Abstract VIII
目錄 X
圖目錄 XII
表目錄 XIV
第一章緒論 1
1.1 前言 1
1.2 研究動機 1
1.3 研究方法 3
1.4 論文架構 4
第二章基於孿生網路之視覺追蹤技術介紹 5
2.1 孿生網路簡介 5
2.2 基於孿生網路之視覺追蹤技術現況 7
2.3 總結 13
第三章基於注意力機制之視覺追蹤技術介紹 14
3.1 注意力機制簡介 14
3.2 基於注意力機制之視覺追蹤方案的現況介紹 17
3.3 總結 19
第四章本論文所提出之基於注意力機制的物件追蹤器 20
4.1 系統架構 20
4.2 本論文所提出之物件追蹤方案 21
4.2.1本論文所提之基於全卷積殘差注意力模組的特徵強化方法 21
4.2.2 本論文所提之基於雙向注意力機制的雙任務融合網路 27
4.2.3 本論文所採用之主幹網路簡介 30
4.3 訓練階段 31
4.3.1 資料前處理 32
4.3.2 損失函數 33
4.4 總結 34
第五章實驗結果與討論 35
5.1 實驗參數 35
5.1.1訓練階段之實驗設定 36
5.1.2測試階段之實驗設定 36
5.2 追蹤系統實驗結果 38
5.2.1在GOT-10k平台的追蹤準確率比較 38
5.2.2在LaSOT平台的追蹤準確率比較 43
5.2.3時間複雜度分析 50
5.3 消融研究 51
5.4 總結 53
第六章結論與未來展望 54
參考文獻 55
符號表 60

參考文獻

[1] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr, “Fully-convolutional Siamese networks for object tracking,” in Proc. European Conference on Computer Vision, pp. 850-865, Sept. 2016.
[2] B. Li, J. Yan, W. Wu, Z. Zhu, and X. Hu, “High Performance Visual Tracking with Siamese Region Proposal Network,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 8971-8980, June 2018.
[3] B. Li, W. Wu, Q. Wang, F. Zhang, J. Xing, and J. Yan, “SiamRPN++: Evolution of Siamese visual tracking with very deep networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4282-4291, June 2019.
[4] D. Guo, J. Wang, Y. Cui, Z. Wang, and S. Chen, “SiamCAR: Siamese fully convolutional classification and regression for visual tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6268-6276, June 2020.
[5] Z. Chen, B. Zhong, G. Li, S. Zhang, and R. Ji, “Siamese box adaptive network for visual tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6667-6676, June 2020.
[6] Z. Zhang, H. Peng, J. Fu, B. Li, and W. Hu, “Ocean: object-aware anchor-free tracking,” in Proc. European Conference on Computer Vision, pp 771-787. Aug. 2020.
[7] M. Danelljan, G. Bhat, F. S. Khan, and M. Felsberg, “ATOM: Accurate tracking by overlap maximization,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4655-4664, June 2019.
[8] G. Bhat, M. Danelljan, L. Van Gool, and R. Timofte, “Learning discriminative model prediction for tracking,” in Proc. IEEE International Conference on Computer Vision, pp. 6181-6190, Oct. 2019.
[9] M. Danelljan, L. Van Gool, and R. Timofte, “Probabilistic regression for visual tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7181-7190, June 2020.
[10] X. Chen, B. Yan, J. Zhu, D. Wang, X. Yang, H. Lu, “Transformer tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 2021.
[11] N. Wang, W. Zhou, J. Wang, and H. Li, “Transformer meets tracker: exploiting temporal context for robust visual tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, June 2021.
[12] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proc. International Conference on Neural Information Processing Systems, pp. 6000-6010, Dec. 2017.
[13] T. Yang and A. B. Chan, “Learning dynamic memory networks for object tracking,” in Proc. European Conference on Computer Vision, pp. 152-167, Oct. 2018.
[14] J. Bromley, I. Guyon, Y. LeCun, E. Säckinger, and R. Shah, “Signature verification using a “Siamese” time delay neural network,” in Conference on Neural Information Processing Systems, Vol. 6, pp. 737-744, Nov. 1993.
[15] S. Chopra, R. Hadsell, and Y. LeCun, “Learning a similarity metric discriminatively, with application to face verification,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 539–546, June 2005.
[16] G. Koch, R. Zemel, and R. Salakhutdinov, “Siamese neural networks for one-shot image recognition,” in Proc. International Conference on Machine Learning Deep Learning Workshop, Vol. 2, July 2015.
[17] D. Held, T. Sebastian, and S. Silvio, “Learning to track at 100 fps with deep regression networks.” in Proc. European Conference on Computer Vision, pp. 749-765, Oct. 2016.
[18] S. Ren, K. He, R. Girshick, and J. Sun, “Faster R-CNN: Towards real-time object detection with region proposal networks,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 39, No. 6, pp. 1137-1149, June 2017.
[19] Y. Wu, J. Lim, and M. Yang, “Object tracking benchmark,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 37, No. 9, pp.1834-1848, Sept. 2015.
[20] M. Kristan, A. Leonardis, J. Matas, M. Felsberg, R. Pfugfelder, L. C. Zajc, T. Vojir, G. Bhat, A. Lukezic, A. Eldesokey, G. Fernandez, and et al, “The sixth visual object tracking VOT2018 challenge results,” in Proc. European Conference on Computer Vision Workshops, pp. 3-53, Jan. 2018.
[21] H. Fan, L. Lin, F. Yang, P. Chu, G. Deng, S. Yu, H. Bai, Y. Xu, C. Liao, and H. Ling. “LaSOT: A high-quality benchmark for large-scale single object tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 5369-5378, June 2018.
[22] M. Muller, A. Bibi, S. Giancola, S. Al-Subaihi, and B. Ghanem, “Trackingnet: A large-scale dataset and benchmark for object tracking in the wild,” in Proc. European Conference on Computer Vision, pp. 300-317, Oct. 2018.
[23] Z. Tian, C. Shen, H. Chen, and T. He, “FCOS: Fully convolutional one-stage object detection,” in Proc. IEEE International Conference on Computer Vision, pp. 9626-9635, Oct. 2019.
[24] L. Huang, X. Zhao, and K. Huang, “GOT-10k: A large high-diversity benchmark for generic object tracking in the wild,” IEEE Trans. Pattern Analysis and Machine Intelligence, Vol. 43, No. 5, pp. 1562-1577, Dec. 2019.
[25] F. Du, P. Liu, W. Zhao, and X. Tang, “Correlation-guided attention for corner detection based visual tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6835-6844, June 2020.
[26] Y. Yu, Y. Xiong, W. Huang, and M. R. Scott, “Deformable Siamese attention networks for visual object tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 6727-6736, June 2020.
[27] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” in Proc. International Conference on Learning Representations, May 2015.
[28] K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using RNN encoder–decoder for statistical machine translation,” in Proc. Conference on Empirical Methods in Natural Language Processing, pp. 1724–1734, Oct. 2014.
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 770-778, June 2016.
[30] A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, June 2014.
[31] X. Wang, R. Girshick, A. Gupta, and K. He, “Non-local neural networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7794-7803, June 2018.
[32] J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 7132-7141, June 2018.
[33] A. He, C. Luo, X. Tian, and W. Zeng, “A twofold Siamese network for real-time object tracking,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 4834–4843, June 2018.
[34] D. C. Luvizon, H. Tabia, and D. Picard, “Human pose regression by combining indirect part detection and contextual information,” Computers & Graphics, Vol. 85, pp. 15-22, Dec. 2019.
[35] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Rethinking the inception architecture for computer vision,” in Proc. IEEE Conference on Computer Vision and Pattern Recognition, pp. 2818-2826, June 2016.
[36] D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: the missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, July 2016.
[37] Y. Xu, Z. Wang, Z. Li, Y. Yuan, and G. Yu, “SiamFC++: Towards robust and accurate visual tracking with target estimation guidelines,” in Proc. AAAI Conference on Artificial Intelligence, Vol. 34, No. 07 pp. 12549-12556, April 2020.
[38] S. Liu and W. Deng, “Very deep convolutional neural network based image classification using small training sample size,” in Proc. IAPR Asian Conference on Pattern Recognition, pp. 730-734, Nov. 2015.
[39] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla, and M. Bernstein, “ImageNet large scale visual recognition challenge,” International Journal of Computer Vision, Vol. 115, No. 3, pp. 211-252, Apr. 2015.
[40] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. the 32nd International Conference on Machine Learning. Vol. 37, pp. 448-456, July 2015.
[41] T. Lin, P. Goyal, R. Girshick, K. He, and P. Dollár, “Focal Loss for Dense Object Detection,” in Proc. IEEE International Conference on Computer Vision, pp. 2999-3007. Oct. 2017.
[42] J. Yu, Y. Jiang, Z. Wang, Z. Cao, and T. Huang, “Unitbox: An advanced object detection network,” in Proc. the 24th ACM International Conference on Multimedia, pp. 516-520, Oct. 2016.
[43] M. Mueller, N. Smith, and B. Ghanem, “A benchmark and simulator for UAV tracking,” in Proc. European Conference on Computer Vision, pp. 445-461, Oct. 2016.
[44] H. K. Galoogahi, A. Fagg, C. Huang, D. Ramanan, and S. Lucey, “Need for speed: A benchmark for higher frame rate object tracking,” in Proc. IEEE International Conference on Computer Vision, pp. 1134-1143, Oct. 2017.

指導教授

唐之瑋(Chih-Wei Tang)

審核日期

2021-7-19

推文