SiamCATR:基於孿生網路之具有交叉注意力機制和通道注意力機制特徵融合的高效視覺追蹤神經網路;SiamCATR: An Efficient and Accurate Visual Tracking via Cross-Attention Transformer and Channel-Attention Feature Fusion Network Based on Siamese Network

NCUIR > College of Electrical Engineering & Computer Science > Graduate Institute of Electrical Engineering > Electronic Thesis & Dissertation > Item 987654321/95823

Please use this identifier to cite or link to this item: http://ir.lib.ncu.edu.tw/handle/987654321/95823

Title:	SiamCATR:基於孿生網路之具有交叉注意力機制和通道注意力機制特徵融合的高效視覺追蹤神經網路;SiamCATR: An Efficient and Accurate Visual Tracking via Cross-Attention Transformer and Channel-Attention Feature Fusion Network Based on Siamese Network
Authors:	李俊霖;Lee, Chun-Lin
Contributors:	電機工程學系
Keywords:	單目標視覺追蹤;神經網路模型;Single Visual Object Tracking;CNN-Transformer Architecture
Date:	2024-08-13
Issue Date:	2024-10-09 17:18:43 (UTC+8)
Publisher:	國立中央大學
Abstract:	視覺目標追蹤任務在電腦視覺中一直是一個重要議題，廣泛應用於自動駕駛、監控系統、無人機等各個領域。其核心目的是在連續的影像序列中準確地跟蹤指定目標，即使在目標出現部分遮擋、光照變化、快速運動或背景複雜的情況下，依然能保持穩定的追蹤效果。隨著深度學習技術的快速發展，視覺目標追蹤網路也從傳統基於特徵匹配的方法演變為利用深度神經網路提取豐富特徵並進行目標追蹤。而近年來，受視覺變換器模型(Vision Transformer)在各種任務中取得成功的影響，視覺目標追蹤網路的性能也取得了顯著的進步，然而，在提升準確度與模型性能的同時，模型的參數量與運算量也大幅增加。由於視覺目標追蹤任務的實際應用往往部署在硬體資源有限的邊緣設備上，於是實時追蹤目標成為一個重大挑戰。因此，如何在保證模型準確度的同時實現高效輕量化的設計成為一個極具挑戰性的研究方向。在本論文中，我們提出了一種融合了卷積神經網路（CNN）和Transformer架構的混合模型，稱為SiamCATR。我們引入了基於Transformer架構的交叉注意力機制來增強模型對特徵圖相似特徵的表現，為了有效融合特徵，我們也引入了通道注意力機制深度互相關，使得目標在每個特徵通道都能被充分結合特徵，上述模組共同組成了高效的特徵融合網路。我們在多個視覺目標追蹤資料集上進行實驗與驗證。實驗結果證明，與當前基於高效輕量化設計的網路架構相比，我們所提出的架構取得最佳的準確度且達到實時追蹤的要求，證明了我們的模型在視覺目標追蹤任務中具有強大的競爭力。;Visual object tracking has been an important issue in computer vision, which is widely used in various fields such as autonomous driving, surveillance systems, and drones. Its core purpose is to accurately track a specified target in a continuous image sequence, and to maintain stable tracking effect even when the target is partially occluded, the light changes, the fast motion or the background is complex. With the rapid development of deep learning technology, visual object tracking networks have evolved from traditional feature-matching methods to leveraging deep neural networks to extract rich features for object tracking. Recently, influenced by the success of Vision Transformer models in various tasks, the performance of visual object tracking networks has also seen significant improvement. However, along with the increase in accuracy and performance, the number of parameters and computational load of these models has also grown substantially. Since the practical applications of visual object tracking tasks are often deployed on edge devices with limited hardware resources, real-time object tracking becomes a major challenge. Therefore, how to achieve high efficiency and lightweight design while ensuring model accuracy has become a highly challenging research direction. In this paper, we propose a hybrid model that combines Convolutional Neural Networks (CNN) and Transformer architecture, named SiamCATR. We introduce a cross-attention mechanism to enhance the model′s performance in identifying similar features in feature maps. To effectively integrate features, we incorporate a channel-attention depthwise cross-correlation mechanism, ensuring that targets can be fully combined within each feature channel. We conducted experiments on multiple visual object tracking datasets. The experimental results demonstrate that our proposed architecture achieves the best accuracy and meets the real-time tracking requirements compared to the current network architectures based on efficient and lightweight designs, proving the competitiveness of our model in visual object tracking tasks.
Appears in Collections:	[Graduate Institute of Electrical Engineering] Electronic Thesis & Dissertation

Files in This Item:

File	Description	Size	Format
index.html		0Kb	HTML	64	View/Open

社群 sharing

Loading...