1、Data MiningCluster Analysis: Basic Concepts and AlgorithmsLecture Notes for Chapter 8Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 What is Cluster Analysis?lFinding groups o
2、f objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groupsInter-cluster distances are maximizedIntra-cluster distances are minimized Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Applications
3、of Cluster AnalysislUnderstanding Group related documents for browsing, group genes and proteins that have similar functionality, or group stocks with similar price fluctuationslSummarization Reduce the size of large data setsClustering precipitation in Australia Tan,Steinbach, Kumar Introduction to
4、 Data Mining 4/18/2004 4 What is not Cluster Analysis?lSupervised classification Have class label informationlSimple segmentation Dividing students into different registration groups alphabetically, by last namelResults of a query Groupings are a result of an external specificationlGraph partitionin
5、g Some mutual relevance and synergy, but areas are not identical Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Notion of a Cluster can be AmbiguousHow many clusters?Four Clusters Two Clusters Six Clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Types of Clustering
6、slA clustering is a set of clusterslImportant distinction between hierarchical and partitional sets of clusters lPartitional Clustering A division data objects into non-overlapping subsets (clusters) such that each data object is in exactly one subsetlHierarchical clustering A set of nested clusters
7、 organized as a hierarchical tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Partitional ClusteringOriginal PointsA Partitional Clustering Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 8 Hierarchical Clusteringp4p1p3p2 p4 p1 p3 p2 p4p1p2p3p4p1 p2p3Traditional Hierarchi
8、cal ClusteringNon-traditional Hierarchical ClusteringNon-traditional DendrogramTraditional Dendrogram Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Other Distinctions Between Sets of ClusterslExclusive versus non-exclusive In non-exclusive clusterings, points may belong to multiple cl
9、usters. Can represent multiple classes or border pointslFuzzy versus non-fuzzy In fuzzy clustering, a point belongs to every cluster with some weight between 0 and 1 Weights must sum to 1 Probabilistic clustering has similar characteristicslPartial versus complete In some cases, we only want to clus
10、ter some of the datalHeterogeneous versus homogeneous Cluster of widely different sizes, shapes, and densities Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Types of Clustersl Well-separated clustersl Center-based clustersl Contiguous clustersl Density-based clusterslProperty or Conc
11、eptuallDescribed by an Objective Function Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 11 Types of Clusters: Well-SeparatedlWell-Separated Clusters: A cluster is a set of points such that any point in a cluster is closer (or more similar) to every other point in the cluster than to any
12、 point not in the cluster. 3 well-separated clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Types of Clusters: Center-BasedlCenter-based A cluster is a set of objects such that an object in a cluster is closer (more similar) to the “center” of a cluster, than to the center of
13、any other cluster The center of a cluster is often a centroid, the average of all the points in the cluster, or a medoid, the most “representative” point of a cluster 4 center-based clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Types of Clusters: Contiguity-BasedlContiguous
14、Cluster (Nearest neighbor or Transitive) A cluster is a set of points such that a point in a cluster is closer (or more similar) to one or more other points in the cluster than to any point not in the cluster.8 contiguous clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Types o
15、f Clusters: Density-BasedlDensity-based A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density. Used when the clusters are irregular or intertwined, and when noise and outliers are present. 6 density-based clusters Tan,Steinbach, Kumar In
16、troduction to Data Mining 4/18/2004 15 Types of Clusters: Conceptual ClusterslShared Property or Conceptual Clusters Finds clusters that share some common property or represent a particular concept. . 2 Overlapping Circles Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Types of Cluste
17、rs: Objective FunctionlClusters Defined by an Objective Function Finds clusters that minimize or maximize an objective function. Enumerate all possible ways of dividing the points into clusters and evaluate the goodness of each potential set of clusters by using the given objective function. (NP Har
18、d) Can have global or local objectives.u Hierarchical clustering algorithms typically have local objectivesu Partitional algorithms typically have global objectives A variation of the global objective function approach is to fit the data to a parameterized model. u Parameters for the model are deter
19、mined from the data. u Mixture models assume that the data is a mixture of a number of statistical distributions. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 Types of Clusters: Objective Function lMap the clustering problem to a different domain and solve a related problem in that
20、domain Proximity matrix defines a weighted graph, where the nodes are the points being clustered, and the weighted edges represent the proximities between points Clustering is equivalent to breaking the graph into connected components, one for each cluster. Want to minimize the edge weight between c
21、lusters and maximize the edge weight within clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Characteristics of the Input Data Are ImportantlType of proximity or density measureThis is a derived measure, but central to clustering lSparsenessDictates type of similarityAdds to ef
22、ficiencylAttribute typeDictates type of similaritylType of DataDictates type of similarityOther characteristics, e.g., autocorrelationlDimensionalitylNoise and OutlierslType of Distribution Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Clustering AlgorithmslK-means and its variantslH
23、ierarchical clusteringlDensity-based clustering Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 K-means ClusteringlPartitional clustering approach lEach cluster is associated with a centroid (center point) lEach point is assigned to the cluster with the closest centroidlNumber of clust
24、ers, K, must be specifiedlThe basic algorithm is very simple Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 K-means Clustering DetailslInitial centroids are often chosen randomly.Clusters produced vary from one run to another.lThe centroid is (typically) the mean of the points in the
25、cluster.lCloseness is measured by Euclidean distance, cosine similarity, correlation, etc.lK-means will converge for common similarity measures mentioned above.lMost of the convergence happens in the first few iterations.Often the stopping condition is changed to Until relatively few points change c
26、lusterslComplexity is O( n * K * I * d )n = number of points, K = number of clusters, I = number of iterations, d = number of attributes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 22 Two different K-means Clusterings-2-1.5-1-0.500.511.5200.511.522.53xy-2-1.5-1-0.500.511.5200.511.522.
27、53xySub-optimal Clustering-2-1.5-1-0.500.511.5200.511.522.53xyOptimal ClusteringOriginal Points Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 Importance of Choosing Initial Centroids-2-1.5-1-0.500.511.5200.511.522.53xyIteration 1-2-1.5-1-0.500.511.5200.511.522.53xyIteration 2-2-1.5-1
28、-0.500.511.5200.511.522.53xyIteration 3-2-1.5-1-0.500.511.5200.511.522.53xyIteration 4-2-1.5-1-0.500.511.5200.511.522.53xyIteration 5-2-1.5-1-0.500.511.5200.511.522.53xyIteration 6 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Importance of Choosing Initial Centroids-2-1.5-1-0.500.51
29、1.5200.511.522.53xyIteration 1-2-1.5-1-0.500.511.5200.511.522.53xyIteration 2-2-1.5-1-0.500.511.5200.511.522.53xyIteration 3-2-1.5-1-0.500.511.5200.511.522.53xyIteration 4-2-1.5-1-0.500.511.5200.511.522.53xyIteration 5-2-1.5-1-0.500.511.5200.511.522.53xyIteration 6 Tan,Steinbach, Kumar Introduction
30、to Data Mining 4/18/2004 25 Evaluating K-means ClusterslMost common measure is Sum of Squared Error (SSE) For each point, the error is the distance to the nearest cluster To get SSE, we square these errors and sum them. x is a data point in cluster Ci and mi is the representative point for cluster C
31、i u can show that mi corresponds to the center (mean) of the cluster Given two clusters, we can choose the one with the smallest error One easy way to reduce SSE is to increase K, the number of clustersu A good clustering with smaller K can have a lower SSE than a poor clustering with higher KKiCxii
32、xmdistSSE12),( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Importance of Choosing Initial Centroids -2-1.5-1-0.500.511.5200.511.522.53xyIteration 1-2-1.5-1-0.500.511.5200.511.522.53xyIteration 2-2-1.5-1-0.500.511.5200.511.522.53xyIteration 3-2-1.5-1-0.500.511.5200.511.522.53xyItera
33、tion 4-2-1.5-1-0.500.511.5200.511.522.53xyIteration 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 Importance of Choosing Initial Centroids -2-1.5-1-0.500.511.5200.511.522.53xyIteration 1-2-1.5-1-0.500.511.5200.511.522.53xyIteration 2-2-1.5-1-0.500.511.5200.511.522.53xyIteration 3-2
34、-1.5-1-0.500.511.5200.511.522.53xyIteration 4-2-1.5-1-0.500.511.5200.511.522.53xyIteration 5 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 28 Problems with Selecting Initial PointslIf there are K real clusters then the chance of selecting one centroid from each cluster is small. Chance
35、is relatively small when K is largeIf clusters are the same size, n, thenFor example, if K = 10, then probability = 10!/1010 = 0.00036Sometimes the initial centroids will readjust themselves in right way, and sometimes they dontConsider an example of five pairs of clusters Tan,Steinbach, Kumar Intro
36、duction to Data Mining 4/18/2004 29 10 Clusters Example05101520-6-4-202468xyIteration 105101520-6-4-202468xyIteration 205101520-6-4-202468xyIteration 305101520-6-4-202468xyIteration 4Starting with two initial centroids in one cluster of each pair of clusters Tan,Steinbach, Kumar Introduction to Data
37、 Mining 4/18/2004 30 10 Clusters Example05101520-6-4-202468xyIteration 105101520-6-4-202468xyIteration 205101520-6-4-202468xyIteration 305101520-6-4-202468xyIteration 4Starting with two initial centroids in one cluster of each pair of clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/20
38、04 31 10 Clusters ExampleStarting with some pairs of clusters having three initial centroids, while other have only one.05101520-6-4-202468xyIteration 105101520-6-4-202468xyIteration 205101520-6-4-202468xyIteration 305101520-6-4-202468xyIteration 4 Tan,Steinbach, Kumar Introduction to Data Mining 4/
39、18/2004 32 10 Clusters ExampleStarting with some pairs of clusters having three initial centroids, while other have only one.05101520-6-4-202468xyIteration 105101520-6-4-202468xyIteration 205101520-6-4-202468xyIteration 305101520-6-4-202468xyIteration 4 Tan,Steinbach, Kumar Introduction to Data Mini
40、ng 4/18/2004 33 Solutions to Initial Centroids ProblemlMultiple runs Helps, but probability is not on your sidelSample and use hierarchical clustering to determine initial centroidslSelect more than k initial centroids and then select among these initial centroids Select most widely separatedlPostpr
41、ocessinglBisecting K-means Not as susceptible to initialization issues Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Handling Empty ClusterslBasic K-means algorithm can yield empty clusterslSeveral strategies Choose the point that contributes most to SSE Choose a point from the clust
42、er with the highest SSE If there are several empty clusters, the above can be repeated several times. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Updating Centers IncrementallylIn the basic K-means algorithm, centroids are updated after all points are assigned to a centroidlAn alte
43、rnative is to update the centroids after each assignment (incremental approach) Each assignment updates zero or two centroids More expensive Introduces an order dependency Never get an empty cluster Can use “weights” to change the impact Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36
44、Pre-processing and Post-processinglPre-processing Normalize the data Eliminate outlierslPost-processing Eliminate small clusters that may represent outliers Split loose clusters, i.e., clusters with relatively high SSE Merge clusters that are close and that have relatively low SSE Can use these step
45、s during the clustering processu ISODATA Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 37 Bisecting K-meanslBisecting K-means algorithmVariant of K-means that can produce a partitional or a hierarchical clustering Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Bisecting K
46、-means Example Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Limitations of K-meanslK-means has problems when clusters are of differing Sizes Densities Non-globular shapeslK-means has problems when the data contains outliers. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004
47、 40 Limitations of K-means: Differing SizesOriginal PointsK-means (3 Clusters) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Limitations of K-means: Differing DensityOriginal PointsK-means (3 Clusters) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Limitations of K-mea
48、ns: Non-globular ShapesOriginal PointsK-means (2 Clusters) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Overcoming K-means LimitationsOriginal PointsK-means ClustersOne solution is to use many clusters.Find parts of clusters, but need to put together. Tan,Steinbach, Kumar Introducti
49、on to Data Mining 4/18/2004 44 Overcoming K-means LimitationsOriginal PointsK-means Clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Overcoming K-means LimitationsOriginal PointsK-means Clusters Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Hierarchical Clusteri
50、ng lProduces a set of nested clusters organized as a hierarchical treelCan be visualized as a dendrogram A tree like diagram that records the sequences of merges or splits13254600.050.10.150.212345612345 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Strengths of Hierarchical Clusteri