1、Chapter19 Clustering AnalysisContent Similarity coefficient Hierarchical clustering analysis Dynamic clustering analysis Ordered sample clustering analysis Discriminant Analysis : having known with certainty to come from two or more populations, its a method to acquire the discriminate model that wi
2、ll allocate further individuals to the correct population. Clustering Analysis: a statistic method for grouping objects of random kind into respective categories. Its used when theres no priori hypotheses, but trying to find the most appropriate sorting method resorting to mathematical statistics an
3、d some collected information. It has become the first selected means to uncover great capacity of genetic messages.Both are methods of multivariate statistics to study classification. Clustering analysis is a method of exploring statistical analysis. It can be classified into two major species accor
4、ding to its aims. For example, m refers to the number of variables(i.e. indexes) while n refers to that of cases(i.e. samples) ,you can do as follows: (1) R-type clustering: also called index clustering. The method to sort the m kinds of indexes, aiming at lowering the dimension of indexes and choos
5、ing typical ones. (2)Q-type clustering: also called sample clustering. The method to sort the n kinds of samples to find the commonness among them. The most important thing for both R-type clustering and Q-type clustering is the definition of similarity, that is how to quantify similarity. The first
6、 step of clustering is to define the metric similarity between two indexes or two samples- similarity coefficient 1 similarity coefficient 1 similarity coefficient of R-type clustering Suppose there are m kinds of variables: X1,X2,Xm. R-type clustering usually use the absolute value of simple correl
7、ation coefficient to define the similarity coefficient among variables: The two variables tend to be more similar when the absolute value increases. Similarly, Spearman rank correlation coefficient can be used to define the similarity coefficient of non-normal variables. But when the variables are a
8、ll qualitative variables, its best to use contingency coefficient. 22()() (19-1)()()iijjijiijjXXXXrXXXX2 Similarity coefficient commonly used in Q-type clustering : Suppose there are n cases regard as n spots in a m dimensions space, distance between two spots can be used to define similarity coeffi
9、cient, the two samples tend to be more similar when the distance declines. (1)Euclidean distance (2)Manhattan distance (3)Minkowski distance: Absolute distance refers to Minkowski distance when q=1;Euclidean distance is direct-viewing and simple to compute, but having not regarded the correlated rel
10、ations among variables. Thats why Manhattan distance was introduced.2() (19-3)ijijdXX| (19-4)ijijdXX| ( 1 9 - 5 )qqi jijdXX(19-5) (4)Mahalanobis distance :its used to express the sample covariance matrix among m kinds of variables. It can be worked out as follows: When its a unit matrix, Mahalanobis
11、 distance equals to the square of Euclidean distance.All of the four distances refer to quantitative variables, for the qualitative variables and ordinal variables, quantization is needed before using. (19-6)ijd1XS X 1122(,)ijijimjmXXXXXXX 2 Hierarchical Clustering Analysis Hierarchical clustering a
12、nalysis is a most commonly used method to sort out similar samples or variables. The process is as follows: 1)At the beginning, samples(or variables) are regarded respectively as one single cluster, that is, each cluster contains only one sample(or variable). Then work out similarity coefficient mat
13、rix among clusters. The matrix is made up of similarity coefficients between samples (or variables). Similarity coefficient matrix is a symmetrical matrix. 2)The two clusters with the maximum similarity coefficient( minimum distance or maximum correlation coefficient) are merged into a new cluster.
14、Compute the similarity coefficient between the new cluster with other clusters. Repeat step two until all of the samples (or variables) are merged into one cluster.The calculation of similarity coefficient between clusters Each step of hierarchical clustering has to calculate the similarity coeffici
15、ent among clusters. When there is only one sample or variable in each of the two clusters, the similarity coefficient between them equals to that of the two samples or the two variables, or compute according to section one. When there are more than one sample or variable in each cluster, many kinds
16、of methods can be used to compute similarity coefficient. Just list 5 kinds of methods as follows. and refer to the two clusters, which respectively has or kinds of samples or variables. pGqGpnqn1The maximum similarity coefficient method If therere respectively , samples(or variables) in cluster and
17、 , herere altogether and similarity coefficients between the two clusters, but only the maximum is considered as the similarity coefficient of the two clusters. Attention :the minimum distance also means the maximum similarity coefficient. 2The Minimum similarity coefficient method similarity coeffi
18、cient between clusters can be calculated as follows:,Min () , 19-7Max () , pqpqpqiji Gj Gpqiji Gj GDdrr样品聚类()指标聚类,Max () , 19-8Min () , pqpqpqiji GjGpqiji GjGDdrr样品聚类()指标聚类pnqnpGqG2pn2qn3. The center of gravity method (only used in sample clustering) The weights are the index means among clusters. I
19、t can be computed as follows:4Cluster equilibration method (only used in sample clustering) work out the average square distance between two samples of each cluster. Cluster equilibration is one of the good methods in the hierarchical clustering, because it can fully reflect the individual informati
20、on within a cluster. 221 (19-10)pqijpqDdn n (19-9)pqpqDdX X5sum of squares of deviations method also called Ward method,only for sample clustering. It imitates the basic thoughts of variance analysis, that is, a rational classification can make the sum of squares of deviation within a cluster smalle
21、r, while that among clusters larger. Suppose that samples have been classified into g clusters, including and . The sum of squares of deviations of cluster from samples is: ( is the mean of ) . The merged sum of squares of deviations of all the g clusters is . If and are merged, there will be g-1 cl
22、usters. The increment of merged sum of squares of deviations is ,which is defined as the square distance between the two clusters. Obviously, when n samples respectively forms a single cluster, the merged sum of squares of deviation is 0.211()knmkijjijLXXjXjXgkLL21pqggDLLnpGqGknkpGqG Sample 19-1 The
23、rere four variables surveying from 3454 female adults : height(X1)、length of legs (X2)、waistline(X3)and chest circumference(X4).The correlation matrix has been worked out as follows: Try to use hierarchical clustering to cluster the 4 indexes. This is a case of R-type(index) clustering. We choose si
24、mple similarity coefficient as the similarity coefficient ,and use maximum similarity coefficient method to calculate the similarity coefficient among clusters.732. 0174. 0234. 0055. 0099. 0852. 0432321)0(XXXXXXR The clustering procedure is listed as follows: (1)each index is regarded as a single cl
25、uster G1=X1,G2=X2,G3=X3,G4=X4.Therere altogether 4 clusters.(2)Merge the two clusters with maximum similarity coefficient into a new cluster. In this case, we merge G1 and G2( similarity coefficient is 0.852) as G5=X1 , X2. Calculate the similarity coefficient among G5、G3 and G4. The similar matrix
26、among G3,G4 and G5:451424Max(,)Max(0.234,0.174)0.234rrr234. 0099. 0732. 05443)1(GGGGR351323Max(,)Max(0.099,0.055)0.099rrr (3)Merge G3 and G4 as G6=G3 , G4, for this time the similarity coefficient between G3 and G4 ranks the largest(0.732). Compute the similarity coefficient between G6 and G5. (4)La
27、stly G5 and G6 are merged into one clusterG7=G5 , G6, which in fact includes all the primitive indexes.563545Max(,)Max(0.099,0.234)0.234rrr Draw the hierarchical dendrogram (picture 19-1)according to the process of clustering. As the picture indicates, its better to be classified into two clusters:
28、X1,X2,X3,X4.That is, length index as one cluster while circumference as the other one. 19-1 4 个指标聚类系统聚类height length waistline chest of legs circumference Picture 19-1 hierarchical dendrogram with 4 indexes Sample 19-2 Table 19-1 lists the means of energy expenditure and sugar expenditure of four at
29、hletic items from six athletes. In order to provide correspondent dietary standard to improve performance record, please cluster the athletic items using hierarchical clustering. Table 19-1 measure values of 4 athletic itemsAthletic itemsEnergy expenditure X1(joule/minute、m2)Sugar expenditure X2(%)W
30、eight loading crouchingG127.89261.421.3150.688Pull-up G223.47556.830.1740.088Push-ups G318.92445.13-1.001-1.441Sit-up G420.91361.25-0.4880.665 We choose Minkowski distance in this sample, and use minimum similarity coefficient method to calculate distances among clusters. To reduce the effect of var
31、iable dimensions, the variables should be standardized before analysis. respectively refers to the sample mean and standard deviation of Xi. The data after transformation are listed in table 19-1., iiiiiiXXXXSS 、 The clustering process:(1)compute the similarity coefficient matrix( i.e. distance matr
32、ix) of the 4 samples. The distance of weight loading crouching and pull-ups can be work out using formula(19-3). Likewise, the distance between weight loading crouching and push-ups can be computed as follows: Lastly,work out the distance matrix:22221211211222()()(1.3150.174)(0.6880.088)1.289dXXXX22
33、221311311232()()(1.3151.001)(0.6881.441)3.145dXXXX168. 2878. 0803. 1928. 1145. 3289. 1432321)0(GGGGGGD (2)The distance between G2 and G4 is the minimum, so G2 and G4 should be emerged into a new cluster G5= G2,G4. Compute the distance between G5 and other clusters using minimum similarity coefficien
34、t method according to formula (19-8).The distance matrix of G1,G3 and G5: (3)Merge G1 and G5 into a new cluster G6= G1,G5. Compute the distance between G6 and G3: (4)lastly merge G1 and G6 into G7=G1 , G6. All the indexes have all been merged into a large cluster.168. 2803. 1145. 35331)1(GGGGD351323
35、Max(,)Max(0.099,0.055)0.099rrr451424Max(,)Max(0.234,0.174)0.234rrr361335M ax(,)M ax(3.145,2.168)3.145ddd According to the process of clustering, draw out the the hierarchy dendrogram (chart 19-2). As the hierarchy dendrogram shows and expertise we have learned, the indexes should be sorted into two
36、clusters: G1,G2,G4 and G3. Physical energy expenditure in weight loading crouching 、pull-ups and sit-ups would be much higher, dietary standard improvement might be required in those items during training. Analysis of clustering examples Different definition of similarity coefficient and that among
37、clusters will cause different clustering results. Expertise as well as clustering method is important to the explanation of clustering analysis. Sample 19-3 twenty-seven petroleum pitch workers and pyro-furnaceman are surveyed about their ages, length of service and smoking information. In addition,
38、 detections of sero- P21, sero-P53, peripheral blood lymphocyte SCE, the number of chromosomal aberration and the number of cells that had happened chromosomal aberration were carried out among these workers (table 19-3). (P21mutiple=P21detection value /the mean of control group P21) Please sort the
39、 27 workers using hierarchical clustering serviceably method. Table 19-3 result of bio-marker detection and clustering analysis of petroleum pitch workers and pyro-furnacemanSampleNumberageLength ofservicesmokeRamus/dSero-P21P21MultipleP53SCENumber ofchromosomeaberrationNumber of cells ofChromosomea
40、berrationresult ofculstering14625521381.680.358.1144235122035102.761.436.84331352252027842.190.544.1133143272024511.930.4711.4596153822032472.560.8011.68551651313037102.920.3711.6022174091031942.510.4011.40551834172046583.670.4611.3533195029050193.950.4713.4510811042202074825.890.1213.11002115730153
41、8002.990.1910.762211236152024781.950.2510.00001133712038273.010.8210.50441145232029842.350.1611.153311552321037492.950.7211.45111011642273049413.890.7313.807611744272039483.110.3313.6516141184021533602.640.3711.40001193821529362.310.6911.401112044272068515.390.9912.28762214327039263.090.4711.9500122
42、2610343813.450.5211.807512337182071425.620.8511.81552242892026122.060.3711.65111252593026382.080.7812.251112634142043223.400.4115.005512750322028622.250.698.80221 This example apply minimum similarity coefficient method originating from Euclidean distance, cluster equilibration method and sum of squ
43、ares of deviations method to cluster the data. The results are listed in chart 19-3, chart19-4 and chart19-5. All the variables have been standardized before analysis. chart 19-3 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using minimum similarity coefficient methodCha
44、rt 19-4 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using cluster equilibration method Chart 19-5 the hierarchy dendrogram of 27 petroleum pitch workers and pyro-furnacemen using sum of squares of deviations method The outcomes of the three kinds of clustering are not
45、the same, from which we can see different ways have different efficiency. The differences are more distinct in case of more variables. So youd better select efficient variables before clustering analysis. Such as the p21 and p53 in this example. You can get more information by reading the clustering
46、 chart. According to expertise ,we can see the outcome of equilibration clustering is more reasonable. The classifying result is filled in the last column. Workers numbered 10,20,23 are classified as one class; others are another .researchers find that workers numbered 10,20,23 are in high risk of c
47、ancer. Number 10,20,23,8,16,26 are clustered together according to the chart of sum of squares of deviations, reminding that workers of 8,16,26 maybe in high risk too.Dynamic clustering If there are too many samples under classified ,hierarchy clustering analysis demands more space to store similari
48、ty coefficient matrix. and is quite inefficient. Whats more ,samples cant be changed once they are classified. Because of these shortcomings, statists put forward dynamic clustering which can overcome the inefficiency and adjust the classifying along with the process of clustering. The principle of
49、dynamic clustering analysis is: firstly, select several representative samples ,called cohesion point, as the core of each class; secondly, classify others. adjust the core of each class until classifying is reasonable . The most common way of dynamic clustering analysis is k-means, which is quite e
50、fficient and its principle is simple. We can get the outcomes even if samples are in large number. However we have to know how many classes the samples are classified into before analysis. we may know under some circumstances in terms of expertise ,but not in other cases. Ordinal Clustering Methods
侵权处理QQ:3464097650--上传资料QQ:3464097650
【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。