FAFU机器学习 09-2lusteringethods中文.pptx_163文库

资源描述

1、机器学习和聚类方法基础（）2020/12/3聚类方法第9-1课聚类方法分区方法（），基于原型的聚类（）分区方法（），基于原型的聚类（）:K-均值均值层次聚类（）基于密度的聚类（）2020/12/3聚类方法第9-2课K均值算法(K)以中心为基础集群是一组对象，使得集群中的一个对象比任何其他集群的中心更接近（更相似）集群的“中心”星团的中心叫做质心每个点被分配到具有最接近质心的聚类中通常应该指定群集的数量2020/12/3聚类方法第9-3课K均值算法将x1，xn划分为K个簇（K是预定义的）初始化指定初始群集中心（质心）迭代直到没有变化对于每个对象xi（群集分配）计算xi和K个质心之间的距离（重新）

2、将xi分配给质心最接近xi的群集根据当前分配更新群集质心（更新群集质心）2020/12/3聚类方法第9-4课K均值:初始化2020/12/3聚类方法第9-5课K均值聚类:聚类分配2020/12/3聚类方法第9-6课K-means聚类:更新聚类质心2020/12/3聚类方法第9-7课误差平方和(SSE)K-均值:假设集群Ci的质心是UI对于Ci中的每个对象x，计算x和质心UI之间的平方误差求出所有对象的误差2020/12/3聚类方法第9-8课K均值2020/12/3聚类方法第9-9课关于K-均值方法的评述力量效率:O(tkn)，其中n是#个对象，k是#个簇，t是#个迭代。正常情况下，k，tn易于

3、实现问题需要指定K，簇的数量局部最小值-初始化问题可能会出现空簇2020/12/3聚类方法第9-10课预处理和后处理预处理归一化数据消除异常值后处理消除可能代表离群值的小聚类拆分“松散”群集，即具有相对较高SSE的群集合并“接近”且SSE相对较低的群集2020/12/3聚类方法第9-11课K-均值法的变体K-均值的大多数变体在相异计算计算聚类均值的策略k-均值的两个重要问题对噪声数据和异常值敏感k-中间点算法只适用于连续多维空间中的物体对分类数据采用K-模式方法2020/12/3聚类方法第9-12课K-均值的局限性K-means在聚类不同时存在问题SizesDensitiesIrregular

4、 shapes2023-11-4Clustering MethodsLesson 9-13Limitations of K-meansK-means has problems when clusters are of differingSizesDensitiesIrregular shapes2023-11-4Clustering MethodsLesson 9-14Limitations of K-meansK-means has problems when clusters are of differingSizesDensitiesIrregular shapes2023-11-4Cl

5、ustering MethodsLesson 9-15Clustering BasicsPartitional Methods（基于划分的方法），prototype-based clustering（基于原型的聚类）：K-meansHierarchical Clustering（层次聚类）（层次聚类）Density-based Clustering（密度聚类）2023-11-4Clustering MethodsLesson 9-16Hierarchical ClusteringAgglomerative approach(聚合策略)Divisive Approaches(分拆策略)2023-

6、11-4Clustering MethodsLesson 9-17Dendrogram（树状图）A tree that shows how clusters are merged/split hierarchicallyEach node on the tree is a cluster;each leaf node is a singleton cluster2023-11-4Clustering MethodsLesson 9-18Dendrogram（树状图）A tree that shows how clusters are merged/split hierarchicallyEac

7、h node on the tree is a cluster;each leaf node is a singleton clusterA clustering of the data objects is obtained by cutting the dendrogram at the desired level,then each connected component forms a cluster2023-11-4Clustering MethodsLesson 9-19Hierarchical ClusteringAgglomerative approach(聚合策略)AGNES

8、(AGglomerative NESting)是一种采用自底向上聚合策略的层次聚类算法AGNES工作过程：先将数据集中的每个样本看作一个初始聚类簇；然后在算法运行的每一步中找出距离最近的两个聚类簇进行合并；步骤（2）不断重复，直至达到预设的聚类簇的个数。关键：如何计算聚类簇之间的距离最小距离，最大距离，平均距离2023-11-4Clustering MethodsLesson 9-20Hierarchical ClusteringAgglomerative approach(聚合策略)AGNES(AGglomerative NESting)是一种采用自底向上聚合策略的层次聚类算法计算聚类簇之间

9、的距离最小距离最大距离平均距离2023-11-4Clustering MethodsLesson 9-21Hierarchical ClusteringAgglomerative approach(聚合策略)AGNES(AGglomerative NESting)是一种采用自底向上聚合策略的层次聚类算法计算聚类簇之间的距离当聚类簇距离为dmin：AGNES 算法被称为单链接单链接（single-linkage）；dmax：AGNES 算法被称为全链接全链接（complete-linkage）；davg：AGNES 算法被称为均链接均链接（average-linkage）2023-11-4Clu

10、stering MethodsLesson 9-22AGNES算法描述输入：样本集 D;聚类簇距离度量函数 dist；聚类簇数 k。输出：簇划分 C=C1,C2,.,Ck1.为每个样本创建一个簇；2.计算距离矩阵；3.开始合并簇过程，初始化聚类簇个数 q=n：A.从距离矩阵中找出距离最近的两个聚类簇 Ci 和Cj，i j；B.合并这两个簇（优先合并到下标较小的簇 Ci=Ci Cj)；C.将聚类簇重新编号（合并到下标较小的簇可以减少重编号的次数）；D.删除距离矩阵的第 j 行与第 j 列；E.计算新 Ci与剩余其他簇之间的距离，更新距离矩阵。F.q=q-1。G.直到 q=k 时，退出循环。4.返

11、回簇划分。2023-11-4Clustering MethodsLesson 9-23Clustering BasicsPartitional Methods（基于划分的方法），prototype-based clustering（基于原型的聚类）：K-meansHierarchical Clustering（层次聚类）Density-based Clustering（密度聚类）（密度聚类）2023-11-4Clustering MethodsLesson 9-24Density-based Clustering Basic ideaClusters are dense regions in

12、the data space,separated by regions of lower object densityA cluster is defined as a maximal set of density-connected pointsDiscovers clusters of arbitrary shapeMethodDBSCAN2023-11-4Clustering MethodsLesson 9-25DBSCANDBSCAN 是一种著君的密度粟类算法?它基于一组“邻域”(neighborhood)参数（,MinPts）来刻画样本分布的紧密程度。给定数据集D=xl,x2,.,x

13、m，定义下面这几个概念:-Neighborhood(-邻域)Objects within a radius of from an object.Given and MinPts,可以把样本中的点分成三类核心点（core point）:A point is a core point if it has more than a specified number of points(MinPts)within its-Neighborhood.边缘点（border point）：A border point has fewer than MinPts within its-Neighborhood,

14、but is in the neighborhood of a core point.离群点（Outlier）：既不是核心点也不是边缘点，则是不属于这一类的点2023-11-4Clustering MethodsLesson 9-26DBSCANDBSCAN 是一种著君的密度粟类算法?它基于一组“邻域”(neighborhood)参数（,MinPts）来刻画样本分布的紧密程度。给定数据集D=xl,x2,.,xm，定义下面这几个概念:-Neighborhood(-邻域)Objects within a radius of from an object.Given and MinPts,可以把样本

15、中的点分成核心点（core point）、边缘点（border point）和离群点（Outlier）密度直达(directly density-reachable)密度可达(density-reachable)密度相连(density-connected)2023-11-4Clustering MethodsLesson 9-27Density-reachabilityDirectly density-reachable（密度直达）An object q is directly density-reachable from object p if p is a core object and

16、 q is in ps-neighborhood.2023-11-4Clustering MethodsLesson 9-28Density-reachabilityDensity-Reachable（密度可达）(directly and indirectly):假设存在样本序列q,p1,p2,pn,pA point p is directly density-reachable from pn pi+1 is directly density-reachable from pip1 is directly density-reachable from qqp1 p2 pn p form a

17、chain2023-11-4Clustering MethodsLesson 9-29Density-reachability密度相连（density-connected）：对xi与xj，若存在xk使得xi与xj均由xk密度可达，则称xi与xj密度相连。密度相连关系满足对称性。2023-11-4Clustering MethodsLesson 9-30DBSCAN算法伪代码输入：数据集，邻域半径，邻域中数据对象数目阈值 MinPts;输出：密度联通簇。1.从数据集中任意选取一个数据对象点 p；2.如果对于参数和 MinPts，所选取的数据对象点 p 为核心点，则找出所有从 p 密度可达的数

18、据对象点，形成一个簇；3.如果选取的数据对象点 p 是边缘点，选取另一个数据对象点；4.重复2、3步，直到所有点被处理。DBSCAN 算法的计算复杂的度为 O(n)，n 为数据对象的数目。这种算法对于输入参数和 MinPts 是敏感的。2023-11-4Clustering MethodsLesson 9-31sklearn.cluster:Clusteringcluster.AgglomerativeClusteringAgglomerative Clusteringcluster.DBSCANPerform DBSCAN clustering from vector array or d

19、istance matrix.cluster.KMeansK-Means clusteringcluster.MiniBatchKMeansMini-Batch K-Means clusteringcluster.SpectralClusteringApply clustering to a projection to the normalized laplacian.cluster.WardWard hierarchical clustering:constructs a tree and cuts it.2023-11-4Clustering MethodsLesson 9-32class

20、 sklearn.cluster.KMeans(n_clusters=8,init=k-means+,n_init=10,max_iter=300,tol=0.0001,precompute_distances=True,verbose=0,random_state=None,copy_x=True,n_jobs=1)n_clusters:int,optional,default:8The number of clusters to form as well as the number of centroids to generate.n_init:int,default:10Number o

21、f time the k-means algorithm will be run with different centroid seeds.The final results will be the best output of n_init consecutive runs in terms of inertia.init:k-means+,random or an ndarrayMethod for initialization,defaults to k-means+:2023-11-4Clustering MethodsLesson 9-33class sklearn.cluster

22、.AgglomerativeClustering(n_clusters=2,affinity=euclidean,memory=Memory(cachedir=None),connectivity=None,n_components=None,compute_full_tree=auto,linkage=ward,pooling_func=)linkage:“ward”,“complete”,“average”,optional,default:“ward”Which linkage criterion to use.The linkage criterion determines which

23、 distance to use between sets of observation.The algorithm will merge the pairs of cluster that minimize this criterion.ward minimizes the variance of the clusters being merged.average uses the average of the distances of each observation of the two plete or maximum linkage uses the maximum distance

24、s between all observations of the two sets.2023-11-4Clustering MethodsLesson 9-34class sklearn.cluster.DBSCAN(eps=0.5,min_samples=5,metric=euclidean,algorithm=auto,leaf_size=30,p=None,random_state=None)eps:float,optionalThe maximum distance between two samples for them to be considered as in the sam

25、e neighborhood.min_samples:int,optionalThe number of samples in a neighborhood for a point to be considered as a core point.metric:string,or callableThe metric to use when calculating distance between instances in a feature array.If metric is a string or callable,it must be one of the options allowed by metrics.pairwise.calculate_distance for its metric parameter.2023-11-4Clustering MethodsLesson 9-35Summary2023-11-4Clustering MethodsLesson 9-36

展开阅读全文