聚类有效性研究综述

周开乐, 杨善林, 丁帅, 罗贺

系统工程理论与实践 ›› 2014, Vol. 34 ›› Issue (9) : 2417-2431.

PDF(826 KB)
PDF(826 KB)
系统工程理论与实践 ›› 2014, Vol. 34 ›› Issue (9) : 2417-2431. DOI: 10.12011/1000-6788(2014)9-2417
论文

聚类有效性研究综述

    周开乐1,2, 杨善林1,2, 丁帅1,2, 罗贺1,2
作者信息 +

On cluster validation

    ZHOU Kai-le1,2, YANG Shan-lin1,2, DING Shuai1,2, LUO He1,2
Author information +
文章历史 +

摘要

聚类是一个无监督学习过程,因此确定最佳聚类数是一项困难的工作. 聚类有效性研究是通过建立聚类有效性指标,评价聚类质量并确定最佳聚类数的过程. 首先,介绍了聚类的数学描述和聚类有效性指标的分类;然后,基于指标 构成成分的不同,分别评述了12 个仅考虑数据集几何结构信息的聚类有效性指标、6 个仅考虑隶属度的聚类有效性指标以及9 个同时考虑数据集几何结构信息和隶属度的聚类有效性指标,分析了不同类型指标的研究现状;接着,简要总结了外部性指标和稳定性指标等其他聚类有效性指标的研究现状;最后,总结并展望了聚类有效性研究面临的挑战和发展方向.

Abstract

Clustering is an unsupervised learning process, hence it is difficult to find the optimal cluster number. Cluster validation is a process in which a cluster validity index (CVI) is constructed to evaluate the quality of clustering results and determine the optimal cluster number. Firstly, the mathematical description of clustering and the classification of CVIs are introduced. Then, 12 CVIs only considering the geometry information of the data set, 6 CVIs only considering the degree of membership, and 9 CVIs considering both the geometry information of the data set and the degree of membership are reviewed respectively based on different components in the indices. And the status quo of each type of CVIs is analyzed. Afterwards, the studies of other types of CVIs, such as external and stability-based indices, are briefly summarized. Finally, we point out the main challenges and research directions in the area of cluster validation.

关键词

聚类 / 聚类有效性 / 聚类有效性指标 / 最佳聚类数

Key words

clustering / cluster validation / cluster validity index / optimal cluster number

引用本文

导出引用
周开乐 , 杨善林 , 丁帅 , 罗贺. 聚类有效性研究综述. 系统工程理论与实践, 2014, 34(9): 2417-2431 https://doi.org/10.12011/1000-6788(2014)9-2417
ZHOU Kai-le , YANG Shan-lin , DING Shuai , LUO He. On cluster validation. Systems Engineering - Theory & Practice, 2014, 34(9): 2417-2431 https://doi.org/10.12011/1000-6788(2014)9-2417
中图分类号: TP18   

参考文献

[1] Anderberg M R. Cluster analysis for application[M]. New York: Academic Press, 1973.
[2] Jain A K, Murty M N, Flynn P J. Data clustering: A review[J]. ACM Computing Survey, 1999, 31(3): 264-323.
[3] Xu R, Wunsch II D. Survey of clustering algorithms[J]. IEEE Transactions on Neural Networks, 2005, 16: 645-678.
[4] Omran M G H, Engelbrecht A P, Salman A. An overview of clustering methods[J]. Intelligent Data Analysis, 2007, 11(6): 583-605.
[5] Giancarlo R, Utro F. Algorithmic paradigms for stability-based cluster validity and model selection statistical methods, with applications to microarray data analysis[J]. Theoretical Computer Science, 2012, 428(4): 58-79.
[6] Bezdek J C. Cluster validity with fuzzy sets[J]. Journal of Cybernetics, 1974, 3(3): 58-74.
[7] Liang J Y, Zhao X W, Li D Y, et al. Determining the number of clusters using information entropy for mixed data[J]. Pattern Recognition, 2012, 45(6): 2251-2265.
[8] Pal N R, Biswas J. Cluster validation using graph theoretic concepts[J]. Pattern Recognition, 1997, 30(6): 847-857.
[9] Halkidi M, Batistakis Y, Vazirgiannis M. On clustering validation techniques[J]. Journal of Intelligent Information Systems, 2001, 17(2-3): 107-145.
[10] Rezaee M R, Lelieveldt B P F, Reiber J H C. A new cluster validity index for the fuzzy c-means[J]. Pattern Recognition Letters, 1998, 19(3-4): 237-246.
[11] Pal N R, Bezdek J C. On cluster validity for the fuzzy c-means model[J]. IEEE Transactions on Fuzzy Systems, 1995, 3(3): 370-379.
[12] Halkidi M, Batistakis Y, Vazirgiannis M. Cluster validity methods: Part I[J]. ACM SIGMOD Record, 2002, 31(2): 40-45.
[13] Halkidi M, Batistakis Y, Vazirgiannis M. Clustering validity checking methods: Part II[J]. ACM SIGMOD Record, 2002, 31(3): 19-27.
[14] Kim M, Ramakrishna R S. New indices for cluster validity assessment[J]. Pattern Recognition Letters, 2005, 26(15): 2353-2363.
[15] Pakhira M K, Bandyopadhyay S, Maulik U. A study of some fuzzy cluster validity indices, genetic clustering and application to pixel classification[J]. Fuzzy Sets and Systems, 2005, 155(2): 191-214.
[16] Wang W, Zhang Y. On fuzzy cluster validity indices[J]. Fuzzy Sets and Systems, 2007, 158(19): 2095-2117.
[17] Jegatha Deborah L, Baskaran R, Kannan A. A survey on internal validity measure for cluster validation[J]. International Journal of Computer Science & Engineering Survey, 2010, 1(2): 85-102.
[18] Gurrutxaga I, Muguerza J, Arbelaitz O, et al. Towards a standard methodology to evaluate internal cluster validity indices[J]. Pattern Recognition Letters, 2011, 32(3): 505-515.
[19] Xu R, Xu J, Wunsch D C. A comparison study of validity indices on swarm-intelligence-based clustering[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 2012, 42(4): 1243-1256.
[20] Guerra L, Robles V, Bielza C, et al. A comparison of clustering quality indices using outliers and noise[J]. Intelligent Data Analysis, 2012, 16(4): 703-715.
[21] Arbelaitz O, Gurrutxaga I, Muguerza J, et al. An extensive comparative study of cluster validity indices[J]. Pattern Recognition, 2013, 46(1): 243-256.
[22] Howe D, Costanzo M, Fey P, et al. Big data: The future of biocuration[J]. Nature, 2008, 455: 47-50.
[23] Bezdek J C, Ehrlish R, Full W. FCM: The fuzzy c-means clustering algorithm[J]. Computers & Geosciences, 1984, 10(2-3): 191-203.
[24] Salem S A, Nandi A K. Development of assessment criteria for clustering algorithms[J]. Pattern Analysis and Applications, 2009, 12(1): 79-98.
[25] Ozkan I, Turksen I B. MiniMax epsilon-stable cluster validity index for Type-2 fuzziness[J]. Information Sciences, 2012, 184(1): 64-74.
[26] Havens T C, Bezdek J C, Palaniswami M. Cluster validity for kernel fuzzy clustering[C]// IEEE International Conference on Fuzzy Systems (FUZZ-IEEE): Brisbane, 2012: 1-8.
[27] Fukui K, Numao M. Neighborhood-based smoothing of external cluster validity measures[J]. Advances in Knowledge Discovery and Data Mining, 2012, 7301: 354-365.
[28] Bolshakova N, Azuaje F, Cunningham P. A knowledge-driven approach to cluster validity assessment[J]. Bioinformatics, 2005, 21(10): 2546-2547.
[29] Lange T, Roth V, Braun M L, et al. Stability based validation of clustering solutions[J]. Neural Computation, 2004, 16(6): 1299-1323.
[30] Falasconi M, Gutierrez A, Pardo M, et al. A stability based validity method for fuzzy clustering [J]. Pattern Recognition, 2010, 43(4): 1292-1305.
[31] Handl J, Knowles J, Kell D B. Computational cluster validation in post-genomic data analysis[J]. Bioinformatics, 2005, 21(15): 3201-3212.
[32] Popescu M, Keller J M, Bezdek J C, et al. Correlation cluster validity[C]// IEEE International Conference on Systems, Man, and Cybernetics, 2011: 2531-2536.
[33] Dunn J C. A fuzzy relative of the ISODATA process and its use in detecting compact well-separated clusters[J]. Journal of Cybernetics, 1973, 3: 32-57.
[34] Bezdek J C, Pal N R. Some new indexes of cluster validity[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 1998, 28(3): 301-315.
[35] Hassar H, Bensaid A, Validation of fuzzy and crisp c-partitions[C]// 18th International Conference of the North American Fuzzy Information Processing Society, 1999: 342-346.
[36] Calinski R B, Harabasz J. A dendrite method for cluster analysis[J]. Communications in Statistics, 1974, 3(1): 1-27.
[37] Milligan G, Cooper M. An examination of procedures for determining the number of clusters in a data set[J]. Psychometrika, 1985, 50(2): 159-179.
[38] Baker F B, Hubert L J. Measuring the power of hierarchical cluster analysis[J]. Journal of the American Statistical, 1975, 70(349): 31-38.
[39] Hubert L, Schultz J. Quadratic assignment as a general data-analysis strategy[J]. British Journal of Mathematical and Statistical Psychology, 1976, 29(2): 190-241.
[40] Davies D L, Bouldin D W. A cluster separation measure[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1979, 2(PAMI-1): 224-227.
[41] Rousseeuw P. Silhouettes: A graphical aid to the interpretation and validation of cluster analysis[J]. Journal of Computational and Applied Mathematics, 1987, 20: 53-65.
[42] Campello R J G B, Hruschka E R. A fuzzy extension of silhouette width criterion for cluster analysis[J]. Fuzzy Sets and Systems, 2006, 157(21): 2858-2875.
[43] Bandyopadhyay S, Maulik U. Nonparametric genetic clustering: Comparison of validity indices[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2001, 31(1): 120-125.
[44] Chou H C, Su M C, Lai E. A new cluster validity measure and its application to image compression[J]. Pattern Analysis and Applications, 2004, 7(2): 205-220.
[45] Saitta S, Raphael B, Smith I. A bounded index for cluster validity[J]. Machine Learning and Data Mining in Pattern Recognition, Lecture Notes in Computer Science, 2007, 4571: 174-187.
[46] Gurrutxaga I, Albisua I, Arbelaitz O, et al. SEP/COP: An efficient method to find the best partition in hierarchical clustering based on a new cluster validity index[J]. Pattern Recognition, 2010, 43(10): 3364-3373.
[47] Zalik K R, Zalik B. Validity index for clusters of different sizes and densities[J]. Pattern Recognition Letters, 2011, 32(3): 221-234.
[48] Halkidi M, Vazirgiannis M. Clustering validity assessment: Finding the optimal partitioning of a data set[C]// First IEEE International Conference on Data Mining, California, 2001: 187-194.
[49] Lago-Fernández L F, Corbacho F. Normality-based validation for crisp clustering[J]. Pattern Recognition, 2010 43(3): 782-795.
[50] 杨善林, 李永森, 胡笑旋, 等. K-means算法中的k值优化问题研究[J]. 系统工程理论与实践, 2006, 26(2): 97-101.Yang Shanlin, Li Yongsen, Hu Xiaoxuan, et al. Optimization study on k value of K-means algorithm[J]. Systems Engineering——Theory & Practice, 2006, 26(2): 97-101.
[51] Bandyopadhyay S, Saha S. A point symmetry-based clustering technique for automatic evolution of clusters[J]. IEEE Transactions on Knowledge and Data Engineering, 2008, 20(11): 1441-1457.
[52] Saha S, Bandyopadhyay S. Performance evaluation of some symmetry-based cluster validity indexes[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part C: Applications and Reviews, 2009, 39(4): 420-425.
[53] Bezdek J C. Numerical taxonomy with fuzzy sets[J]. Journal of Mathematical Biology, 1974, 7(1): 57-71.
[54] Roubens M. Pattern classification problems and fuzzy sets[J]. Fuzzy Sets and Systems, 1978, 1(4): 239-253.
[55] Dunn J C. Fuzzy automata and decision processes[M]. Elsevier, 1977.
[56] Kim Y I, Kim D W, Lee D, et al. A cluster validation index for GK cluster analysis based on relative degree of sharing[J]. Information Sciences, 2004, 168(1-4): 225-242.
[57] Chen M Y, Linkens D A. Rule-base self-generation and simplification for data-driven fuzzy models[J]. Fuzzy Sets and Systems, 2004, 142(2): 243-265.
[58] Kim D W, Lee K H, Lee D. On cluster validity index for estimation of the optimal number of fuzzy clusters[J]. Pattern Recognition, 2004, 37(10): 2009-2025.
[59] Zalik K R. Cluster validity index for estimation of fuzzy clusters of different sizes and densities[J]. Pattern Recognition, 2010, 43(10): 3374-3390.
[60] Fukuyama Y, Sugeno M. A new method of choosing the number of cluster for the fuzzy c-means method[C]// 5th Fuzzy Systems Symposium. Kobe, 1989: 247-250.
[61] Xie X L, Beni G. A validity measure for fuzzy clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1991, 13(8): 841-847.
[62] Bensaid A M, Hall L O, Bezdek J C, et al. Validity-guided (re)clustering with applications to image segmentation[J]. IEEE Transactions on Fuzzy Systems, 1996, 4(2): 112-123.
[63] Kwon S H. Cluster validity index for fuzzy clustering[J]. Electronics Letters, 1998, 34(22): 2176-2177.
[64] Tang Y G, Sun F C, Sun Z Q. Improved validation index for fuzzy clustering[C]// American Control Conference, Portland, 2005: 1120-1125.
[65] Zahid N, Limouri M, Essaid A. A new cluster-validity for fuzzy clustering[J]. Pattern Recognition, 1999, 32(7): 1089-1097.
[66] Pakhira M K, Bandyopadhyay S, Maulik U. Validity index for crisp and fuzzy clusters[J]. Pattern Recognition, 2004, 37(3): 487-501.
[67] Tsekouras G E, Sarimveis H. A new approach for measuring the validity of the fuzzy c-means algorithm[J]. Advances in Engineering Software, 2004, 35(8-9): 567-575.
[68] Wu K L, Yang M S. A cluster validity index for fuzzy clustering[J]. Pattern Recognition Letters, 2005, 26(9): 1275-1291.
[69] Gath J, Geva A B. Unsupervised optimal fuzzy clustering[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 1989, 11(7): 773-781.
[70] Wei W, Mendel J M. Optimality tests for the fuzzy c-means algorithm[J]. Pattern Recognition, 1994, 27(11): 1567-1573.
[71] Bouguessa M, Wang S R. A new efficient validity index for fuzzy clustering[C]// 3rd International Conference on Machine Learning and Cybernetics, Shanghai, 2004: 1914-1919.
[72] Xie Y, Raghavan V V, Dhatric P, et al. A new fuzzy clustering algorithm for optimally finding granular prototypes[J]. International Journal of Approximate Reasoning, 2005, 40(1-2): 109-124.
[73] Rand W M. Objective criteria for the evaluation of clustering methods[J]. Journal of the American Statistical Association, 1971, 66(336): 846-850.
[74] Batagelj V, Bren M. Comparing resemblance measures[J]. Journal of Classification, 1995, 12(1): 73-90.
[75] Campello R J G B. A fuzzy extension of the Rand index and other related indexes for clustering and classification assessment[J]. Pattern Recognition Letters, 2007, 28(7): 833-841.
[76] Steinley D. Properties of the Hubert-Arable adjusted rand index[J]. Psychological Methods, 2004, 9(3): 386-396.
[77] Jiang D, Tang C, Zhang A. Cluster analysis for gene-expression data: A survey[J]. IEEE Transactions on Knowledge and Data Engineering, 2004, 16(11): 1370-1386.
[78] Fowlkes E B, Mallows C L. A method for comparing two hierarchical clusterings[J]. Journal of the American Statistical Association, 1983, 78(383): 553-569.
[79] Jain A K, Dubes R C. Algorithms for clustering data[M]. Prentice Hall, 1988.
[80] Borgelt C, Kruse R. Finding the number of fuzzy clusters by resampling[C]// 2006 IEEE International Conference on Fuzzy Systems, Vancouver, BC, 2006: 48-54.
[81] Brouwer R K. Extending the rand, adjusted rand and jaccard indices to fuzzy partitions[J]. Journal of Intelligent Information Systems, 2009, 32(3): 213-235.
[82] Wu J J, Chen J, Xiong H, et al. External validation measures for K-means clustering: A data distribution perspective[J]. Expert Systems with Applications, 2009, 36(3): 6050-6061.
[83] Campello R J G B. Generalized external indexes for comparing data partitions with overlapping categories[J]. Pattern Recognition Letters, 2010, 31(9): 966-975.
[84] Rendón E, Abundez I, Arizmendi A, et al. Internal versus external cluster validation indexes[J]. International Journal of Computers and Communications, 2011, 5(1): 27-34.
[85] Rendón E, Abundez I, Gutierrez C, et al. A comparison of internal and external cluster validation indexes[C]// American Conference on Applied Mathematics and the 5th WSEAS International Conference on Computer Engineering and Applications, 2011: 158-163.
[86] Lange T, Roth V, Braun M L, et al. Stability-based validation of clustering solutions[J]. Neural Computation, 2004, 16(6): 1299-1323.
[87] Volkovich Z, Barzily Z, Morozensky L. A statistical model of cluster stability[J]. Pattern Recognition, 2008, 41(7): 2174-2188.
[88] Saha S, Bandyopadhyay S. Some connectivity based cluster validity indices[J]. Applied Soft Computing, 2012, 12(5): 1555-1565.
[89] Ahmadi A, Karray F, Kamel M S. Model order selection for multiple cooperative swarms clustering using stability analysis[J]. Information Sciences, 2012, 182(1): 169-183.
[90] Pascual D, Pla F, Sanchez J S. Cluster validation using information stability measures[J]. Pattern Recognition Letters, 2010, 31(6): 454-461.
[91] Popescu M, Bezdek J C, Keller J M, et al. A new cluster validity measure for bioinformatics relational datasets[C]// IEEE International Conference on Fuzzy Systems, 2008: 726-731.
[92] Bolshakova N, Azuaje F, Cunningham P. Incorporating biological domain knowledge into cluster validity assessment[J]. Applications of Evolutionary Computing, Lecture Notes in Computer Science, 2006, 3907: 13-22.
[93] Speer N, Spieth C, Zell A. Biological cluster validity indices based on the Gene Ontology[J]. Advances in Intelligent Data Analysis VI, Lecture Notes in Computer Science, 2005, 3646: 429-439.
[94] Asuncion A, Newman D J. UCI machine learning repository[EB/OL]. [2013-05-28]. http://archive.ics.uci.edu/ml/ index.html.
[95] Geva A B, Steinberg Y, Bruckmair S, et al. A comparison of cluster validity criteria for a mixture of normal distributed data[J]. Pattern Recognition Letters, 2000, 21(6-7): 511-529.
[96] Dimitriadou E, Dolňicar S, Weingessel A. An examination of indexes for determining the number of clusters in binary data sets[J]. Psychometrika, 2002, 67(1): 137-159.
[97] Maulik U, Bandyopadhyay S. Performance evaluation of some clustering algorithms and validity indices[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2002, 24(12): 1650-1654.
[98] Zhang Y J, Wang W N, Zhang X N, et al. A cluster validity index for fuzzy clustering[J]. Information Science, 2008, 178(4): 1205-1218.
[99] Pal N R, Bezdek J C. Correction to on cluster validity for the fuzzy c-means model[J]. IEEE Transactions on Fuzzy Systems, 1997, 5(1): 152-153.
[100] Kashef R, Kamel M S. Cooperative clustering[J]. Pattern Recognition, 2010, 43(6): 2315-2329.
[101] Hu X, Yoo I. Cluster ensemble and its applications in gene expression analysis[C]// 2nd Asia-Pacific Bioinformatics Conference, 2004: 297-302.
[102] Günter S, Bunke H. Validation indices for graph clustering[J]. Pattern Recognition Letters, 2003, 24(8): 1107-1113.
[103] Yu J, Cheng Q S, Huang H K. Analysis of the weighting exponent in the FCM[J]. IEEE Transactions on System, Man, and Cybernetics, Part B: Cybernetics, 2004, 34(1): 634-639.
[104] Wu K L. Analysis of parameter selections for fuzzy c-means[J]. Pattern Recognition, 2012, 45(1): 407-415.
[105] Zhou K L. Fuzziness parameter selection of fuzzy c-means algorithm used for load classification considering cluster validity[J]. Journal of Information & Computational Science, 2012, 9(17): 5181-5187.
[106] 杨善林, 罗贺, 丁帅. 基于云计算的多源信息服务系统研究综述[J]. 管理科学学报, 2012, 15(5): 83-96.Yang Shanlin, Luo He, Ding Shuai. Survey on multi-sources information service system based on cloud computing[J]. Journal of Management Sciences in China, 2012, 15(5): 83-96.
[107] 周开乐, 丁帅, 胡小建. 面向海量数据应用的物联网信息服务系统研究综述[J]. 计算机应用研究, 2012, 29(1): 8-11.Zhou Kaile, Ding Shuai, Hu Xiaojian. Survey of massive data applications oriented Internet of things information service system[J]. Application Research of Computers, 2012, 29(1): 8-11.
[108] Chen H, Chiang R H, Storey V C. Business intelligence and analytics: From big data to big impact[J]. MIS Quarterly, 2012, 36(4): 1165-1188.
[109] Yousri N A, Kamel M S, Ismail M A. A distance-relatedness dynamic model for clustering high dimensional data of arbitrary shapes and densities[J]. Pattern Recognition, 2009, 42(7): 1193-1209.
[110] Pei T, Jasra A, Hand D J, et al. DECODE: A new method for discovering clusters of different densities in spatial data[J]. Data Mining and Knowledge Discovery, 2009, 18(3): 337-369.
[111] 周开乐, 杨善林. 一种考虑数据类大小和密度差异的模糊聚类有效性指标[J]. 情报学报, 2013, 32(3): 306-313.Zhou Kaile, Yang Shanlin. A fuzzy cluster validity index in consideration of different size and density of data set[J]. Journal of the China Society for Scientific and Technical Information, 2013, 32(3): 306-313.
[112] Chen G, Guo X Y, Hu T. A noise insensitive cluster validity measure for pattern classification[C]// 8th IEEE/ACIS International Conference on Computer and Information Science, 2009: 574-578.
[113] Wu J J, Yuan H, Xiong H, et al. Validation of overlapping clustering: A random clustering perspective[J]. Information Sciences, 2010, 180(22): 4353-4369.
[114] Bouguessa M, Wang S, Sun H. An objective approach to cluster validation[J]. Pattern Recognition Letters, 2006, 27(13): 1419-1430. }

基金

国家高技术研究发展计划(863计划)(2011AA05A116);国家自然科学基金(71131002,71071045,71201042)
PDF(826 KB)

602

Accesses

0

Citation

Detail

段落导航
相关文章

/