摘要: | 現今線上新聞服務普遍提供Really Simple Syndication(RSS)頻道讓使用者訂閱,但是使用者在面對如此多的RSS頻道中,如何能夠有效率地選擇和獲得想要的資訊,這是智慧型網路資訊檢索服務在系統設計上所面臨的主要挑戰。 本研究以RSS新聞資料流為新聞來源,設計一套應用於本地端新聞資料庫與遠端RSS文件之間的RSS新聞資料同步機制,並且透過使用者事先設定的關鍵字,由系統自動地為使用者監測相關新聞。本研究提出兩套監測機制,分別為Clustering Based on only Temporal Information (CBTI) 與Time-Constrained TF-IDF Schemes(TCTIS)。首先,CBTI機制利用K-Means演算法以RSS新聞發布之時間對單條RSS頻道做群聚運算,再根據群聚運算所得的群集(cluster)之中心點時間(centroid time)來建立不同頻道之間的群集關係,系統依據此群集關係進一步將不同條頻道的群集合併為單條結果,以供使用者檢視。另一方面,TCTIS機制則透過TF-IDF/IWF遞增模型進行新聞主題偵測與追蹤,系統在偵測出一個新主題時會發出通知給使用者,並持續地追蹤舊主題的相關報導,以利使用者調閱過去舊主題的相關報導。 然而,由於“新聞文字上的經常性修正與調整”,此一特性導致本地端資料庫與遠端RSS文件之間的同步機制不易設計,本研究提出依據RSS文摘(Item)所具有的四個子標籤(標題、描述、連結和發布時間)內容字串,更進一步地交叉判斷兩文摘間的新舊關係,以提升所蒐集到資料的可靠性。再者,由於RSS新聞文摘本身所存在的“短文特性”,導致傳統的TF-IDF/IWF遞增模型在RSS新聞資料流中做主題性事件監測時無法有良好的分群效果,本研究提出一加入時間考量的主題偵測與追蹤機制(即TCTIS機制),使得以增強TF-IDF/IWF遞增模型在RSS新聞資訊流下主題偵測與追蹤的效果。 最後,本研究指出實作上在蒐集RSS文件時所遭遇的問題,可供對RSS有興趣的研究人員在進行RSS 文件資料蒐集或是RSS閱讀器軟體程式開發時之參考。Online news providers now offer subscription services of the Really Simple Syndication (RSS) channels. Users with many RSS channels however feel awkward to use when they want to find and watch interesting news items dispersed in separate channels simultaneously. How to select and acquire wanted information efficiently is a significant challenge for designing an intelligent news information retrieval system. The study of this thesis uses RSS news streams as news sources, and proposes a news data synchronization mechanism for synchronizing the remote RSS documents and the local news database. Then, the proposed mechanism is able to automatically monitor the related news in response to users’ pre-given keywords. Specifically, this proposal includes two com-plementary monitoring schemes: Clustering Based on only Temporal Information (CBTI) and Time-Constrained TF-IDF (TCTIS) Schemes. The CBTI uses the K-Means algorithm to cluster RSS news items in every channel corresponding to their temporal information. Then, CBTI uses the cluster centroid time of each cluster in each channel to find the temporal relationship among other clusters in multiple channels. Finally, CBTI uses this relationship to construct a merged channel for the user to read. On the other hand, TCTIS utilizes the incremental TF-IDF/IWF model to do topic-based detection and tracking processes. When a news item reporting a new topic is detected, the mechanism could notify users of this event and continually track related news items from old topics, thereby gathering all related items for users to later read them in an efficient and friendly way. However, owing to frequent changes of news texts, the design of news data synchroniza-tion mechanism further considers four specific labels inside news content, particularly <title, description, link and pubDate> and compares every pair of items to discern their relation. For example, which is new or both are the same. In addition, because an RSS news item is a short text itself, the clustered results based on the traditional incremental TF-IDF/IWF is not good enough. To cope with this problem, TCTIS is able to enhance the performance by additionally taking the temporal factor into consideration. Furthermore, this study lists several practical points in regard to RSS news gathering and RSS reader software development. It is believed that they are worthy of notice by interested researchers. |