CHAMELEON-A-Hierarchical-Clustering-Algorithm-：变色龙的层次聚类算法课件.ppt_163文库

资源描述

1、2001/12/18CHAMELEON1CHAMELEON:A Hierarchical Clustering Algorithm Using Dynamic ModelingPaper presentation in data mining classPresenter:許明壽;蘇建仲Data:2001/12/182001/12/18CHAMELEON2About this paper lDepartment of Computer Science and Engineering,University of MinnesotalGeorge KarypislEui-Honh(Sam)Hanl

2、Vipin KumarlIEEE Computer Journal-Aug.19992001/12/18CHAMELEON3OutlinelProblems definitionlMain algorithmlKeys features of CHAMELEONlExperiment and related workedlConclusion and discussion2001/12/18CHAMELEON4Problems definitionlClusteringlIntracluster similarity is maximizedlIntercluster similarity i

3、s minimizedlProblems of existing clustering algorithmslStatic model constrainlBreakdown when clusters that are of diverse shapes，densities，and sizeslSusceptible to noise,outliers,and artifacts 2001/12/18CHAMELEON5Static model constrainlData space constrainlK means,PAM etclSuitable only for data in m

4、etric spaceslCluster shape constrainlK means,PAM,CLARANSlAssume cluster as ellipsoidal or globular and are similar sizeslCluster density constrainlDBScanlPoints within genuine cluster are density-reachable and point across different clusters are notlSimilarity determine constrainlCURE,ROCKlUse stati

5、c model to determine the most similar cluster to merge 2001/12/18CHAMELEON6Partition techniques problem(a)Clusters of widely different sizes(b)Clusters with convex shapes2001/12/18CHAMELEON7Hierarchical technique problem(1/2)lThe(c),(d)will be choose to merge when we only consider closeness2001/12/1

6、8CHAMELEON8Hierarchical technique problem(2/2)lThe(a),(c)will be choose to merge when we only consider inter-connectivity2001/12/18CHAMELEON9Main algorithmlTwo phase algorithmlPHASE IlUse graph partitioning algorithm to cluster the data items into a large number of relatively small sub-clusters.lPHA

7、SE IIlUses an agglomerative hierarchical clustering algorithm to find the genuine clusters by repeatedly combining together these sub-clusters.2001/12/18CHAMELEON10Framework ConstructSparse GraphPartition the GraphMerge PartitionFinal ClustersData Set2001/12/18CHAMELEON11Keys features of CHAMELEONlM

8、odeling the datalModeling the cluster similaritylPartition algorithmslMerge schemes2001/12/18CHAMELEON12TermslArguments neededlKlK-nearest neighbor graphlMINSIZElThe minima size of initial clusterlTRIlThreshold of related inter-connectivity lTRC lThreshold of related intra-connectivityllCoefficient

9、for weight of RI and RC2001/12/18CHAMELEON13Modeling the datalK-nearest neighbor graph approachlAdvantageslData points that are far apart are completely disconnected in the GklGk capture the concept of neighborhood dynamicallylThe edge weights of dense regions in Gk tend to be large and the edge wei

10、ghts of sparse tend to be small2001/12/18CHAMELEON14Example of k-nearest neighbor graph2001/12/18CHAMELEON15Modeling the clustering similarity(1/2)lRelative interconnectivitylRelative closeness2|),(,CjCiCjCiECECECCjCiRICjCiCjCiECECECSCjCiCjSCjCiCiSCjCiRC|),(,2001/12/18CHAMELEON16Modeling the cluster

11、ing similarity(2/2)If related is considered,(c),(d)will be merged2001/12/18CHAMELEON17Partition algorithm(PHASE I)lWhatlFinding the initial sub-clusterslWhylRI and RC cant be accurately calculated for clusters containing only a few data pointslHowlUtilize multilevel graph partitioning algorithm(hMET

12、IS)lCoarsening phaselPartitioning phaselUncoarsening phase2001/12/18CHAMELEON18Partition algorithm(cont.)lInitiallall points belonging to the same clusterlRepeat until(size of all clusters MINSIZE)lSelect the largest cluster and use hMETIS to bisectlPost scriptumlBalance constrainlSpilt Ci into CiA

13、and CiB and each sub-clusters contains at least 25%of the node of Ci2001/12/18CHAMELEON192001/12/18CHAMELEON20lWhatlMerging sub-clusters using a dynamic frameworklHowlFinding and merging the pair of sub-clusters that are the most similarlScheme 1lScheme 2Merge schemes(Phase II),(*),(jijiCCRCCCRIRIji

14、TCCRI),(andRCjiTCCRC),(2001/12/18CHAMELEON21Experiment and related workedlIntroduction of CURElIntroduction of DBSCANlResults of experimentlPerformance analysis2001/12/18CHAMELEON22Introduction of CURE(1/n)lClustering Using Representative points1.Properties:lFit for non-spherical shapes.lShrinking c

15、an help to dampen the effects of outliers.lMultiple representative points chosen for non-sphericallEach iteration,representative points shrunk ratio related to merge procedure by some scattered points chosen lRandom sampling in data sets is fit for large databases2001/12/18CHAMELEON23Introduction of

16、 CURE(2/n)2.Drawbacks:lPartitioning method can not prove data points chosen are good.lClustering accuracy with respect to the parameters below:l(1)Shrink factor s:CURE always find the right clusters by range of s values from 0.2 to 0.7.l(2)Number of representative points c:CURE always found right cl

17、usters for value of c greater than 10.l(3)Number of Partitions p:with as many as 50 partitions,CURE always discovered the desired clusters.l(4)Random Sample size r:(a)for sample size up to 2000,clusters found poor quality(b)from 2500 sample points and above,about 2.5%of the data set size,CURE always

18、 correctly find the clusters.2001/12/18CHAMELEON243.Clustering algorithm:Representative points 2001/12/18CHAMELEON25Merge procedure2001/12/18CHAMELEON26Introduction of DBSCAN(1/n)lDensity Based Spatial Clustering of Application With Noise1.Properties:lCan discovery clusters of arbitrary shape.lEach

19、cluster with a typical density of points which is higher than outside of cluster.lThe density within the areas of noise is lower than the density in any of the clusters.lInput the parameters MinPts onlylEasy to implement in C+language using R*-treelRuntime is linear depending on the number of points

20、.lTime complexity is O(n*log n)2001/12/18CHAMELEON27Introduction of DBSCAN(2/n)2.Drawbacks:lCannot apply to polygons.lCannot apply to high dimensional feature spaces.lCannot process the shape of k-dist graph with multi-features.lCannot fit for large database because no method applied to reduce spati

21、al database.3.DefinitionslEps-neighborhood of a point plNEps(p)=qD|dist(p,q)=MinPts(core point condition)lWe know directly density-reachable is symmetric when p and q both are core point,otherwise is asymmetric if one core point and one border point.5.p is density-reachable from q if there is a chai

22、n of points between p and qlDensity-reachable is transitive,but not symmetriclDensity-reachable is symmetric for core points.2001/12/18CHAMELEON29Introduction of DBSCAN(4/n)6.A point p is density-connected to a point q if there is a point s such that both p and q are density-reachable from s.lDensit

23、y-connected is symmetric and reflexive relationlA cluster is defined to be a set of density-connected points which is maximal density-reachability.lNoise is the set of points not belong to any of clusters.7.How to find cluster C?lMaximalityl p,q:if p C and q is density-reachable from p,then q ClConn

24、ectivityl p,q C:p is density-connected to q8.How to find noises?l p,if p is not belong to any clusters,then p is noise point2001/12/18CHAMELEON30Results of experiment2001/12/18CHAMELEON31Performance analysis(1/2)lThe time of construct the k-nearest neighborlLow-dimensional data sets based on k-d tre

25、es,overall complexity of O(n log n)lHigh-dimensional data sets based on k-d trees not applicable,overall complexity of O(n2)lFinding initial sub-clusterslObtains m clusters by repeated partitioning successively smaller graphs,overall computational complexity is O(n log(n/m)lIs bounded by O(n log n)l

26、A faster partitioning algorithm to obtain the initial m clusters in time O(n+m log m)using multilevel m-way partitioning algorithm2001/12/18CHAMELEON32Performance analysis(2/2)lMerging sub-clusters using a dynamic frameworklThe time of compute the internal inter-connectivity and internal closeness f

27、or each initial cluster is which is O(nm)lThe time of the most similar pair of clusters to merge is O(m2 log m)by using a heap-based priority queuelSo overall complexity of CHAMELEONs is O(n log n+nm+m2 log m)2001/12/18CHAMELEON33Conclusion and discussionlDynamic model with related interconnectivity and closenesslThis paper ignore the issue of scaling to large datalOther graph representation methodology?lOther Partition algorithm?

展开阅读全文