在單目標追蹤中,採用階層式(hierarchical)的Vision Transformer(ViT)架構的追蹤器,往往追蹤表現不如plain ViT,同時文獻彼此之間架構都是有差異的,並沒有一個通用的網路架構。本論文提出一個通用的階層式網路架構(HyperXTrack),第一個將骨幹網路的架構,引用到追蹤任務上作為交互作用網路,同時加入時空上下文,空間的上下文是多尺度資訊,時間的上下文提供歷史資訊。HyperXTrack能進行全局與局部空間交互作用,且交互作用計算複雜度為影像解析度的線性複雜度。HyperXTrack每一個block都是先進行比對細緻紋理特徵,再進行整個物件外觀輪廓的交互比對。交互骨幹網路採用本論文所提之注意力機制,同時採用經典的堆疊規則在注意力機制前使用卷積。最後,本論文提出輕量的重新預訓練策略,可以使用預訓練好的MaxViT網路參數,將更改網路交互運算的網路重新訓練一個epoch,就可以讓網路的參數可以遷移到下游任務上。實驗結果顯示,本論文設計的HyperXTrack架構在GOT-10k數據集上AO以75%超越OSTrack的71%,同時僅需要使用30M參數量的階層式架構,就可以超越OSTrack的93M參數量的ViT架構。;In single object tracking, the hierarchical Vision Transformer (ViT) architectures usually perform worse than plain ViT among current trackers. At the same time, the network architectures of state-of-the-art trackers are distinct, and thus there is no general purposed network architecture. This paper presents HyperXTrack, the first backbone network architecture that is applied to interaction in visual tracking. In addition, the proposed backbone interacts spatio-temporal context, where spatial context is the multi-scale information and temporal context provides historical information. HyperXTrack proceeds global and local spatial interaction, and computation complexity is linear with image resolution. After correlating with local texture features, the contour of the entire object is interacting. Interaction backbone networks adopt the proposed attention mechanism and the classic stacking rule where convolutions are applied before attention mechanism. Finally, this thesis proposes lightweight re-pretraining strategy. After modifying the existing network MaxViT, this thesis uses the pre-trained MaxViT weights, and re-pretrains only one epoch. Then the network can transfer to the downstream tasks. The experimental results show that HyperXTrack surpasses OSTrack′s 71% in AO with 71.8% on the GOT-10k dataset. HyperXTrack using a hierarchical architecture only needs 30M parameters, which can surpass OSTrack architecture with 93M parameters.