場景文字偵測的研究在近年來有突破性的發展,並且有著許多不同的應用,例如文件文字偵測及停車場的車牌辨識。但是,對於像是招牌、告示牌等任意形狀的場景文字偵測依然存在著許多問題,例如,許多方法沒辦法將彎曲的文字完整的標示出來,也無法有效的分開相鄰的文字。因此,我們提出了一個更有效的模型,它可以更有效的融合及利用特徵,並偵測出任意形狀的場景文字。我們是基於文字的中心區域進行預測,並透過後處理將預測出的機率圖進行擴張,得到整個文字區域的結果。我們提出Multi-Scale Feature Fusion Network以更有效的萃取及融合特徵,其中包含了結合Self-Attention Module (SAM)的Multi-Scale Attention Module (MSAM),可以更有效的精煉特徵,最後由Self-Attention Head (SAH)預測文字機率圖。本文透過實驗證實了此方法的效果,在Total-Text數據集上得到87.4分的F-score。;The research on scene text detection has made breakthroughs in recent years and has many different applications, such as document text detection and license plate recognition in parking lots. However, there are still many problems in scene text detection with arbitrary shapes such as signboards and billboards. For example, many methods cannot mark curved text fully, nor can they effectively separate adjacent text. Therefore, we propose a more efficient model, which can more effectively fuse and utilize features and detect scene texts of arbitrary shapes. In this paper, the result is predicted based on the central area of the text, and the predicted probability map is expanded through post-processing to obtain the result of the entire text area. We propose a Multi-Scale Feature Fusion Network to extract and fuse features more effectively, including Multi-Scale Attention Modules (MSAMs) combined with Self-Attention Modules (SAMs), which can refine features more effectively. Finally, Self-Attention Head (SAH) predicts the text probability map. We confirm the effect of this method through experiments and achieve F-score of 87.4 on the Total-Text dataset.