蒐集直播串流資訊之自動化爬蟲系統

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：33

、訪客IP：18.118.150.80

姓名

郭維勳(Wei-Xun Kuo) 查詢紙本館藏

畢業系所

通訊工程學系

論文名稱

蒐集直播串流資訊之自動化爬蟲系統
(Automatic Crawling System for Collecting Live Streaming Information)

相關論文

★ 利用智慧天線系統實現精準室內定位技術	★ 電力線通訊之競爭存取與路由方法設計與實現
★ 設計與實作基於GRAPES函式庫之P2P即時串流系統	★ 利用離散餘弦基礎之聲音浮水印達到室內定位技術
★ 利用虛擬指紋建置法之智慧型天線系統實現精準室內定位技術	★ 即時影像串流自適應播放系統之研究
★ 利用模糊邏輯控制器於蜂巢式網路降低位置管理機制成本	★ 基於支持向量機及模糊推理之地震預警系統研製
★ 基於行動裝置之分散式多人會議系統	★ 以分群為基礎之3D無線與光學網路晶片頻道存取方法
★ 基於收前先聽LBR機制之授權型輔助接入LAA架構下於異質網路中暴露節點之研究	★ 支援跳頻之IEEE 802.15.4 ZigBee無線隨身網路機制設計與實現
★ 應用於IEEE 802.16行動無線都會網路省電模式參數設定之智慧策略	★ IEEE 802.15.4 ZigBee 無線隨身網路高效能路由演算法分析與設計
★ 應用於IEEE 802.16無線寬頻都會網路之具調適性自動重傳請求回報機制	★ 無線感測網路為基礎之空間平面圖自動建構之技術

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

隨著電腦網路及行動通訊技術的發展，頻寬已經足以支撐多媒體應用，現代人們已經習慣使用3C產品收看影音，有線電視台與傳統電視台的收視市場也已逐漸式微。傳統的直播只能從電臺或是電視台，但隨著技術的發展，直播已經是人人隨手可得傳播資訊的方式之一。
自 2016 年來，直播產業逐漸興盛，不論人在哪裡都可透過直播即時與直播主互動，有許多商家透過直播販賣商品，更成為「電商直播」新興產業，可見直播呈現爆炸式的發展趨勢。
網頁時光機為全球的網頁保留下數以億計的歷史記錄，許多網頁可能因經營不善或其他原因而關站，多數可以在網頁時光機中找到。隨著網頁技術的發展新興的網站都已經採用動態內容的技術來設計網站，因此網頁時光機只能擷取很少量的內容。
因應大直播時代的來臨，卻沒有一個歷史資料庫妥善蒐集直播平台的資訊，因此本研究提出針對直播平台的自動化內容爬蟲系統。若想完整蒐集直播平台的頻道資訊必須由爬蟲工程師針對每個直播平台設計專用的爬蟲程式。直播產業的經濟市場越大意謂著有越多的新平台希望分一杯羹，新的直播平台將會不停的誕生，舊平台也會為了提升使用者體驗不斷推陳出新。基於以上問題，本研究想設計一套自動化的直播平台資訊爬蟲系統，為因應新平台的誕生及既有平台的改版，皆可自動化爬蟲程式的運作。
本研究提出之爬蟲系統分為三種爬蟲類型，分別為API爬蟲、AJAX爬蟲、DOM爬蟲。系統會依據平台的網頁架構找到最適合的爬蟲類型來進行資料的蒐集。API爬蟲視直播平台有無提供API服務，再依據API文件撰寫爬蟲程式，此部分為人工處理。AJAX爬蟲則擷取直播平台載入資料的HTTP Request，再進行過濾及參數判斷，得到動態內容的Request URL。DOM爬蟲抓取直播平台網頁後將網頁轉換成DOM Tree架構，判斷重複出現的直播區塊，再從區塊中提取直播頻道資訊。
三種爬蟲以API及AJAX爬蟲的效能最佳，每次取得資料只需傳送輕量的HTTP Request，DOM爬蟲通用性最高，需要執行瀏覽器再透過操作瀏覽器取得直播資訊，因此效能最差，但DOM爬蟲可成功爬取大部分直播平台的資訊。

摘要(英)

With the development of computer network and radio access technologies, the bandwidth is sufficient to support multimedia applications. Today, people are accustomed to using 3C products to watch video and access the media. The market of the cable TV and traditional TV has gradually declined. The traditional live streaming can only obtain from radio or TV, but with the development of technology, live streaming is already one of the ways for everyone to spread information.
Since 2016, the live streaming industry has gradually flourished. No matter where people are, they can interact with the live streaming host in real time through live streaming platform. Many merchants sell products through live streaming, and it has become an emerging industry of "e-commerce over live streaming ". Live streaming shows an explosive development trend.
“Wayback Machine” keeps hundreds of millions of historical records for global webpages. Many webpages may close due to poor management or other reasons. With the development of webpage technology, most websites have used dynamic content technology to design websites, so “Wayback Machine” can only capture a small amount of content.
In the face of the popularity of live streaming, there is no historical database to collect information on the live streaming platform completely, so this study proposes an automated content crawler system for the live streaming platform. To collect the channel information of the live streaming platform completely, a crawler engineer must design a dedicated crawler program for each live streaming platform. The larger economic market of the live streaming industry means that there are more new platforms want to share a slice of the cake. New live streaming platforms will be born all the time, and old platforms will constantly update to improve user experience. Based on the problems above, this study wants to design an automated information crawler system of live streaming platform, which can automate the operation of the crawler program in response to the new platform and the revision of the existing platform.
The automated crawler system proposed in this study divide into three types of crawlers, namely API crawler, AJAX crawler, and DOM crawler. The system will find the most suitable type of crawler according to the webpage structure of the platform to collect data. The API crawler depends on whether the live streaming platform provides API services, and then writes the crawler program according to the API document. This part processed manually. The AJAX crawler captures the HTTP Request of the data loaded by the live streaming platform, and then performs filtering and parameter judgment to obtain the Request URL for dynamic content. The DOM crawler crawls the webpage of the live streaming platform and converts the webpage into a DOM Tree structure, judges the repeated live streaming blocks, and then extracts live streaming channel information from the blocks.
The API crawler and AJAX crawler have the best performance. Each time data is retrieved, only a light HTTP request is sent. The DOM crawler has the highest versatility. It needs to execute the browser and then obtain the live streaming information through the browser, so the performance is the worst, but the DOM crawler can successfully crawl the information of most live streaming platforms.

關鍵字(中)

★ 動態網頁爬蟲
★ 直播爬蟲
★ DOM爬蟲
★ AJAX爬蟲
★ 直播平台爬蟲

關鍵字(英)

★ Dynamic Web Crawler
★ Live Streaming Crawler
★ DOM Crawler
★ AJAX Crawler
★ Live Streaming Platform Crawler

論文目次

中文摘要 i
ABSTRACT iii
CONTENTS v
LIST OF FIGURES vi
LIST OF TABLES vii
1. INTRODUCTION 1
2. BACKGROUND 5
2-1 Deep Web 5
2-2 Web Crawler 5
2-3 Document Object Model 9
2-4 Web Browser Driver 10
2-5 Selenium 13
3. RELATED WORKS 15
4. DESIGN AND MECHANISM 17
4-1 API Crawler 18
4-2 AJAX Crawler 19
4-2-1 HTTP Request Capture 20
4-2-2 HTTP Request Filter 22
4-2-3 Web Crawling Action Recognition 25
4-2-4 AJAX crawler 26
4-3 DOM crawler 26
4-3-1 Generate Parameter File 28
4-3-2 Handling of update of live streaming web 30
5. EVALUATION 31
5-1 AJAX Crawler Method results 31
5-1-1 AJAX URL filter results 31
5-1-2 Web crawling action recognition 33
5-2 DOM Crawler Method results 34
6. PERFORMANCE 36
7. CONCLUSIONS AND FUTURE WORK 39
8. REFERENCES 40

參考文獻

[1] https://twitchtracker.com/statistics
[2] https://www.statista.com/chart/11151/streaming-is-mainstream-for-young-adults/
[3] https://www.cnbc.com/2018/03/19/tyler-ninja-blevins-explains-how-he-makes-more-than-500000-a-month-playing-video-game-fortnite.html
[4] https://en.wikipedia.org/wiki/Ninja_(gamer)
[5] https://en.wikipedia.org/wiki/Wayback_Machine
[6] https://www.alexa.com/topsites
[7] https://brightplanet.com/2012/06/04/deep-web-a-primer/
[8] https://en.wikipedia.org/wiki/Deep_web
[9] https://en.wikipedia.org/wiki/Web_crawler
[10] https://www.elliance.com/aha/infographics/robotstxt-file-explained.aspx
[11] Zhiyong Zhang and Olfa Nasraoui, ”Profile-based focused Crawler for Social Media-Sharing Websites”, Proc. of IEEE International Conference on Tools with Artificial Intelligence, p. 319, Nov. 2008.
[12] Pierre Laperdrix, Nataliia Bielova, Benoit Baudry and Gildas Avoine, ”Browser Fingerprinting: A survey”, Proc of ACM Transactions on the Web, p.8:5, April 2020.
[13] https://hackr.io/blog/complete-guide-selenium-webdriver
[14] Wu Wei, Shengsheng Shi, Yulong Liu, Haitao Wang, Chunfeng Yuan, and Yihua Huang, “Extraction Rule Language for Web Information Extraction and Integration”, Proc of 10th Web Information System and Application Conference, p. 6570, Nov. 2013,
[15] Debina Laishram and Merin Sebastian, “Extraction of web news from web pages using a ternary tree approach”, Proc. 2nd International Conference on Advances in Computing and Communication Engineering (ICACCE 2015), pp. 628-633, 2015
[16] Yan Guo, Huifeng Tang, Linhai Song, Yu Wang, Guodong Ding, “ECON: An Approach to Extract Content from Web News Page”, Proc. 12th international Asia-Pacific Web Conference, pp. 314–320, 2010.
[17] Madhura R. Kaddu and Dr.R.B.Kulkarni, “To Extract Informative Content from online web pages by using Hybrid Approach”, Proc. International Conference on Electrical, Electronics, and Optimization Techniques (ICEEOT), pp. 972 – 977, 2016.
[18] https://dev.twitch.tv/docs/api/
[19] https://en.wikipedia.org/wiki/Media_type
[20] https://youtube.com

指導教授

許獻聰(Shiann-Tsong Sheu)

審核日期

2020-7-30

推文