平行化資訊理論共分群演算法

以作者查詢圖書館館藏

、以作者查詢臺灣博碩士

、以作者查詢全國書目

、勘誤回報

、線上人數：71

、訪客IP：52.15.116.59

姓名

趙士賢(Shih-Hsien Chao) 查詢紙本館藏

畢業系所

軟體工程研究所

論文名稱

平行化資訊理論共分群演算法
(Parallel Information-Theoretic Co-Clustering based on MapReduce)

相關論文

★ 行程邀約郵件的辨識與不規則時間擷取之研究	★ NCUFree校園無線網路平台設計及應用服務開發
★ 網際網路半結構性資料擷取系統之設計與實作	★ 非簡單瀏覽路徑之探勘與應用
★ 遞增資料關聯式規則探勘之改進	★ 應用卡方獨立性檢定於關連式分類問題
★ 中文資料擷取系統之設計與研究	★ 非數值型資料視覺化與兼具主客觀的分群
★ 關聯性字組在文件摘要上的探討	★ 淨化網頁：網頁區塊化以及資料區域擷取
★ 問題答覆系統使用語句分類排序方式之設計與研究	★ 時序資料庫中緊密頻繁連續事件型樣之有效探勘
★ 星狀座標之軸排列於群聚視覺化之應用	★ 由瀏覽歷程自動產生網頁抓取程式之研究
★ 動態網頁之樣版與資料分析研究	★ 同性質網頁資料整合之自動化研究

檔案

[Endnote RIS 格式]

[Bibtex 格式]

[相關文章]

[文章引用]

[完整記錄]

[館藏目錄]

[檢視]

[下載]

本電子論文使用權限為同意立即開放。
已達開放權限電子全文僅授權使用者為學術研究之目的，進行個人非營利性質之檢索、閱讀、列印。
請遵守中華民國著作權法之相關規定，切勿任意重製、散佈、改作、轉貼、播送，以免觸法。

摘要(中)

資料分群(Data Clustering)在各種領域被廣泛的應用，如:資料探勘(Data Mining)、文件檢索(Document Retrieval)、影像分割(Image Segmentation)、樣式分類(Pattern Classification)等等。傳統資料分群演算法通常只能用在小規模資料分析上。如今，做資料分群時，常常必須面臨好幾Gigabytes的資料量，一般電腦已經無法再處理龐大的資料。為了解決這些問題，許多研究員嘗試去設計出許多有效率的平行化分群演算法(Parallel Clustering Algorithm) 來做大型資料分群。
本論文中我們聚焦在Information-Theoretic Co-clustering (ITCC)演算法，ITCC是一種共分群演算法，它可以同時對行與列去作分群，並且其objective function是以行向量與列向量之mutual information作為基礎。ITCC被廣泛地用在許多領域，如: Text mining、Social recommendation system、生物資訊領域等等。
在本篇論文中，我們提出Parallel Information-Theoretic Co-Clustering (PITCC)演算法，由於要處理的資料量相當龐大，我們使用一種近幾年來新興且熱門的平行化運算平台Hadoop，以Map-Reduce的方式來進行運算。Map-Reduce廣泛的被學術界(Academia)與業界(Industry)所接受，是一種簡單而且非常強大的programming方法。Hadoop除了具有高擴充性，還具有易於使用等優點。並且我們使用了CAMRa2011比賽所release的資料集。最後我們將在實驗部分使用了三種評估效能的方法來衡量我們的實驗，並且證明我們所提出的演算法，是一個相當有效率且能處理龐大的資料集的方法。

摘要(英)

Data clustering is used in many domains widely. For example: data mining, document retrieval, image segmentation, pattern classification, etc. Traditional clustering algorithms are usually used for small-scale data analysis. At present, we usually have to deal with the large data, which cannot be dealt with in single computer. To solve these problems, many researchers attempt to design efficient parallel clustering algorithms for huge data.
In this paper we focus on Information-Theoretic Co-clustering (ITCC) which is a simultaneous clustering of the rows and columns based on mutual information between the clustered random variables subject to constraints on the number of row and column clusters. ITCC is widely used in many domains, such as text mining, social recommendation system, and bio-informatics, etc.
We propose a Parallel Information-Theoretic Co-Clustering (PITCC) algorithm based on MapReduce. Because we need to analyze huge data, we develop our algorithm on cloud computing platform based on Hadoop. MapReduce is a programming model which has been widely embraced by both academia and industry because of high scalability and easy use. We use the movie recommendation contest “CAMRa2011” dataset for our experiments, and evaluate our experiment results in terms of speedup, sizeup and scaleup. The experimental results demonstrate that the proposed algorithm is very powerful and efficient, and it can process large datasets on commodity hardware.

關鍵字(中)

★ 共分群
★ 雲端

關鍵字(英)

★ co-clustering
★ could computing
★ Hadoop
★ MapReduce

論文目次

中文摘要 II
1. 緒論 1
2. 相關研究探討 4
2.1. MAPREDUCE &HADOOP 4
2.2. 共分群 (CO-CLUSTERING) 8
3. 背景知識: INFORMATION-THEORETIC CO-CLUSTERING (ITCC) 12
3.1. 符號定義 12
3.2. ITCC 框架與演算法 13
4. 平行化共分群演算法(PITCC ALGORITHM) 16
4.1. PARALLEL ITCC FRAMEWORK 16
4.2. PARALLEL ITCC ALGORITHM 19
5. 實驗 27
5.1. 資料集與條件 27
5.2. 評估方法 30
5.3. 結果 31
5.3.1. Speedup 31
5.3.2. Sizeup 32
5.3.3. Scaleup 33
6. 結論 34
7. 參考文獻 35

參考文獻

[1] Hadoop. http://hadoop.apache.org/core/.
[2] HBase. http://hadoop.apache.org/hbase/.
[3] Tom Write, “Hadoop: The Definitive Guide, 2nd Edition,” O’’Reilly (2011).
[4] Borthakur, D., “The Hadoop Distributed File System: Architecture and Design” (2007).
[5] Ghemawat, S., Gobioff, H., Leung, S. “The Google File System.” Symposium on Operating Systems Principles (2003).
[6] F. Chang, J. Dean, S. Ghemawat, W. C. Hsieh, D. A. Wallach, M. Burrows, T. Chandra, A. Fikes, and R. E. Gruber. “Bigtable: A distributed storage system for structured data,” Operating Systems Design and Implementation (OSDI 2006).
[7] Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Operating Systems Design and Implementation (OSDI 2004).
[8] Dean, J., Ghemawat, S. “MapReduce: Simplified Data Processing on Large Clusters,” Communications of The ACM (2008).
[9] Jimmy Lin and Chris Dyer, “Data-Intensive Text Processing with MapReduce,” Morgan & Claypool Publishers (2010).
[10] Ranger, C., Raghuraman, R., Penmetsa, A., Bradski, G., Kozyrakis, C., “Evaluating MapReduce for Multi-core and Multiprocessor Systems.” High-Performance Computer Architecture (HPCA 2007).
[11] Lammel, R. “Google’s MapReduce Programming Model - Revisited.” Science of Computer Programming (2008).
[12] Weizhong Zhao, Huifang Ma, and Qing He, “Parallel K-Means Clustering Based on MapReduce,” CloudCom. Lecture Notes in Computer Science (LNCS 2009).
[13] MacQueen, J. “Some Methods for Classification and Analysis of Multivariate Observations,” 5th Berkeley Symp. Math. Statist, Prob. (1967).
[14] Xu, X., Jager, J., Kriegel, H.P “A Fast Parallel Clustering Algorithm for Large Spatial Databases,” Data Mining and Knowledge Discovery (KDD 1999).
[15] Xin Yue Yang, Zhen Liu, and Yan Fu, “MapReduce as a Programming Model for Association Rules Algorithm on Hadoop,” Information Sciences and Interaction Sciences (ICIS 2010).
[16] I. S. Dhillon, S. Mallela, and D. S. Modha. “Information theoretic Co-clustering,” Knowledge Discovery and Data Mining Conference (KDD 2003).
[17] Y. Cheng and G.M. Church. “Biclustering of expression data,” American Association for Articial Intelligence (AAAI 2000).
[18] Ramanathan, V., “Parallelizing an Information Theoretic Co-clustering Algorithm Using a Cloud Middleware,” International Conference on Data Mining Workshops (ICDMW 2010).
[19] Spiros Papadimitriou, Jimeng Sun., “DisCo: Distributed Co-clustering with Map-Reduce: A Case Study towards Petabyte-Scale End-to-End Mining,” IEEE International Conference on Data Mining (ICDM 2008).
[20] H. Li and N. Abe. “Word clustering and disambiguation based on co-occurence data,” the Association for Computational Linguistics (COLING-ACL 1998).
[21] D. Agarwal and S. Merugu, “Predictive discrete latent factor models for large scale dyadic data,” Knowledge Discovery and Data Mining Conference (KDD 2007).
[22] D. Chakrabarti, S. Papadimitriou, D. Modha, and C. Faloutsos. “Fully automatic cross-associations,” Knowledge Discovery and Data Mining Conference (KDD 2004).
[23] H. Cho, I. Dhillon, Y. Guan, and S. Sra, “Minimum sum-squared residue co-clustering of gene expression data,” SIAM International Conference on Data Mining (SDM 2004).
[24] S. C. Madeira, and A. L. Oliveira, “Biclustering algorithms for biological data analysis: A survey,” IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB 2004), 1.
[25] http://www.emc.com/leadership/programs/digital-universe.htm

指導教授

張嘉惠(Chia-Hui Chang)

審核日期

2012-7-27

推文