FAFU机器学习 06-1ecisionree中文.pptx_163文库

资源描述

1、机器学习决策树基础数据挖掘十大算法C4.5.K均值支持向量机先验的EM（最大似然）PageRank阿达博斯特KNN奈韦巴耶斯推车推车主要分类方法逻辑回归线性判别分析决策树归纳最近的邻居贝叶斯分类方法反向传播分类支持向量机集合方法说明分类任务决策树示例决策树的另一个例子决策树分类任务对测试数据应用模型决策树分类任务10决策树归纳算法许多算法许多算法:亨特算法（最早的算法之一）ID3（迭代二分法）C4.5.分类回归树任务中的监督学习决策树的可伸缩性，可并行化归纳11决策树归纳算法基本算法（一种贪婪算法）树是以自顶向下递归分治的方式构造的在开始时，所有的训练例子都是在根部属性是范畴的（如果是连续值，

2、则它们被预先离散化）示例是基于所选属性递归分区的基于启发式或统计度量（例如，信息增益）选择测试属性停止分区的条件给定节点的所有样本都属于同一个类没有剩余的属性用于进一步的分区-采用多数投票来对叶进行分类没有样本了12决策树归纳算法13决策树归纳算法贪婪的策略贪婪的策略基于优化某些标准的属性测试拆分记录（）问题问题确定如何选择最佳属性？如何拆分到记录？如何确定最佳拆分？确定何时停止分裂14决策树归纳算法如何拆分记录？如何拆分记录？取决于属性类型取决于属性类型名义上序数连续的取决于拆分方式的数量取决于拆分方式的数量2路分流多路分流15决策树归纳算法基于名义属性的拆分？基于名义属性的拆分？多路拆分多

3、路拆分:使用与不同值一样多的分区使用与不同值一样多的分区二进制拆分二进制拆分:将值分成两个子集将值分成两个子集需要找到最优分区16决策树归纳算法基于序数属性的拆分？基于序数属性的拆分？多路拆分多路拆分:使用与不同值一样多的分区使用与不同值一样多的分区二进制拆分二进制拆分:将值分成两个子集将值分成两个子集需要找到最优分区17决策树归纳算法基于连续属性的拆分基于连续属性的拆分离散化以形成有序的范畴属性离散化以形成有序的范畴属性二进制判决二进制判决:(A=v)consider all possible splits and finds the best cut can be more computa

4、tion intensive 18Algorithm for Decision Tree InductionGreedy strategy Split the records based on an attribute test that optimizes certain criterion Issues Determine how to select the best attributeHow to split the records?How to determine the best split?Determine when to stop splitting 19Algorithm f

5、or Decision Tree InductionHow to determine the Best SplitBefore Splitting:10 records of class C0,10 records of class C120Algorithm for Decision Tree InductionHow to determine the Best SplitGreedy approach:Nodes with homogeneous class distribution are preferred Need a measure of node impurity(purity)

6、:21How to determine the Best SplitNeed a measure 22Measures of Node Impurity EntropyGini Index Brief Review of Entropyn23m=224Attribute Selection Measure:Information Gain(ID3/C4.5)nSelect the attribute with the highest information gainnLet pi be the probability that an arbitrary tuple in D belongs t

7、o class Ci,estimated by|Ci,D|/|D|nExpected information(entropy)needed to classify a tuple in D:nInformation needed(after using A to split D into v partitions)to classify D:nInformation gained by branching on attribute AInformation gain is the difference between the entropy of the parent node and the

8、 weighted average of the children nodes entropies.)(log)(21imiippDInfo)(|)(1jvjjADInfoDDDInfo(D)InfoInfo(D)Gain(A)A25Attribute Selection:Information GaingClass P:buys_computer=“yes”gClass N:buys_computer=“no”means“age=30”has 5 out of 14 samples,with 2 yeses and 3 nos.HenceSimilarly,694.0)2,3(145)0,4

9、(144)3,2(145)(IIIDInfoage048.0)_(151.0)(029.0)(ratingcreditGainstudentGainincomeGain246.0)()()(DInfoDInfoageGainageageincomestudentcredit_ratingbuys_computer=30highnofairno40mediumnofairyes40lowyesfairyes40lowyesexcellentno3140lowyesexcellentyes=30mediumnofairno40mediumyesfairyes40mediumnoexcellentn

10、o)3,2(145I940.0)145(log145)149(log149)5,9()(22 IDInfo26Computing Information-Gain for Continuous-Valued AttributesLet attribute A be a continuous-valued attributeMust determine the best split point for ASort the value A in increasing orderTypically,the midpoint between each pair of adjacent values i

11、s considered as a possible split point(ai+ai+1)/2 is the midpoint between the values of ai and ai+1The point with the minimum expected information requirement for A is selected as the split-point for ASplit:D1 is the set of tuples in D satisfying A split-point,and D2 is the set of tuples in D satisf

12、ying A split-pointGain Ratio for Attribute Selection(C4.5)Information gain measure is biased towards attributes with a large number of valuesC4.5(a successor of ID3)uses gain ratio to overcome the problem(normalization to information gain)GainRatio(A)=Gain(A)/SplitInfo(A)Ex.gain_ratio(income)=0.029/

13、1.557=0.019The attribute with the maximum gain ratio is selected as the splitting attribute)|(log|)(21DDDDDSplitInfojvjjAGini Index(CART,IBM IntelligentMiner)If a data set D contains examples from n classes,gini index,gini(D)is defined as where pj is the relative frequency of class j in DIf a data s

14、et D is split on A into two subsets D1 and D2,the gini index gini(D)is defined asReduction in Impurity:The attribute provides the smallest ginisplit(D)(or the largest reduction in impurity)is chosen to split the node(need to enumerate all the possible splitting points for each attribute for CART)njp

15、jDgini121)()(|)(|)(2211DginiDDDginiDDDginiA)()()(DginiDginiAginiAComputation of Gini IndexEx.D has 9 tuples in buys_computer=“yes”and 5 in“no”Suppose the attribute income partitions D into 10 in D1:low,medium and 4 in D2 Ginilow,high is 0.458;Ginimedium,high is 0.450.Thus,split on the low,medium(and

16、 high)since it has the lowest Gini index459.01451491)(22Dgini)(144)(1410)(21,DGiniDGiniDginimediumlowincome30Comparing Attribute Selection MeasuresThe three measures,in general,return good results butInformation gain:biased towards multivalued attributesGain ratio:tends to prefer unbalanced splits i

17、n which one partition is much smaller than the othersGini index:biased to multivalued attributeshas difficulty when#of classes is largetends to favor tests that result in equal-sized partitions and purity in both partitions31Overfitting and Tree PruningOverfitting:An induced tree may overfit the tra

18、ining data Too many branches,some may reflect anomalies due to noise or outliersPoor accuracy for unseen samplesTwo approaches to avoid overfitting Prepruning:Halt tree construction early do not split a node if this would result in the goodness measure falling below a thresholdDifficult to choose an

19、 appropriate thresholdPostpruning:Remove branches from a“fully grown”treeget a sequence of progressively pruned treesUse a set of data different from the training data to decide which is the“best pruned tree”Enhancements to Basic Decision Tree InductionAllow for continuous-valued attributesDynamical

20、ly define new discrete-valued attributes that partition the continuous attribute value into a discrete set of intervalsHandle missing attribute valuesAssign the most common value of the attributeAssign probability to each of the possible valuesAttribute constructionCreate new attributes based on exi

21、sting ones that are sparsely representedThis reduces fragmentation,repetition,and replicationClassification in Large DatabasesClassificationa classical problem extensively studied by statisticians and machine learning researchersScalability:Classifying data sets with millions of examples and hundred

22、s of attributes with reasonable speedWhy is decision tree induction popular?relatively faster learning speed(than other classification methods)convertible to simple and easy to understand classification rulescan use SQL queries for accessing databasescomparable classification accuracy with other met

23、hodsDecision Tree classifier in sklearnclass sklearn.tree.DecisionTreeClassifier(criterion=gini,splitter=best,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fraction_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,c

24、lass_weight=None,presort=False)https:/scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.htmlhttps:/ Iris dataset（Iris数据集数据集,鸢尾属植物）The Iris dataset is a classic dataset from the 1930s;it is one of the first modern examples of statistical classification.The setting is that

25、of Iris flowers,of which there are multiple species that can be identified by their morphology.Today,the species would be defined by their genomic signatures,but in the 1930s,DNA had not even been identified as the carrier of genetic information.iris以鸢尾花的特征作为数据来源，数据集包含150个数据集，分为3类，每类50个数据，每个数据包含4个属性

26、，是在数据挖掘、数据分类中非常常用的测试集、训练集三类分别为:setosa,versicolor,virginicasetosa,versicolor,virginicaThe following four attributes（四种属性）of each plant were measured:Sepal length（萼片长度）Sepal width（萼片宽度）Petal length（花瓣长度）Petal width（花瓣宽度）The first step is visualization（可（可视视化）化）from matplotlib import pyplot as pltfrom

27、 sklearn.datasets import load_irisimport numpy as np#We load the data with load_iris from sklearndata=load_iris()features=datadatafeature_names=datafeature_namestarget=datatargetfor t,marker,c in zip(range(3),ox,rgb):#We plot each class on its own to get different colored markers plt.scatter(feature

28、starget=t,0,featurestarget=t,1,marker=marker,c=c)sklearn.tree.DecisionTreeClassifier from sklearn import tree X=0,0,1,1 Y=0,1 clf=tree.DecisionTreeClassifier()clf=clf.fit(X,Y)clf.predict(2.,2.)array(1)clf.predict_proba(2.,2.)array(0.,1.)sklearn.tree.DecisionTreeClassifier from sklearn.datasets impor

29、t load_iris from sklearn.cross_validation import cross_val_score from sklearn.tree import DecisionTreeClassifier clf=DecisionTreeClassifier(random_state=0)iris=load_iris()cross_val_score(clf,iris.data,iris.target,cv=10).array(1.,0.93.,0.86.,0.93.,0.93.,0.93.,0.93.,1.,0.93.,1.)ExampleInternet Adverti

30、sements Data Set http:/archive.ics.uci.edu/ml/datasets/Internet+AdvertisementsData Set Characteristics:MultivariateNumber of Instances:3279Area:ComputerAttribute Characteristics:Categorical,Integer,RealNumber of Attributes:1558Date Donated1998-07-01Associated Tasks:ClassificationMissing Values?YesNu

31、mber of Web Hits:307075Decision Tree regressor回归树（regression tree），顾名思义，就是用树模型做回归问题，每一片叶子都输出一个预测值。预测值一般是该片叶子所含训练集元素输出的均值，即 cm=ave(yi|xileafm)。CART 在分类问题和回归问题中的相同和差异：相同：在分类问题和回归问题中，CART 都是一棵二叉树，除叶子节点外的所有节点都有且仅有两个子节点；所有落在同一片叶子中的输入都有同样的输出。差异：在分类问题中，CART 使用基尼指数（Gini index）作为选择特征（feature）和划分（split）的依据；在回

32、归问题中，CART 使用 mse（mean square error）或者 mae（mean absolute error）作为选择 feature 和 split 的 criteria。在分类问题中，CART 的每一片叶子都代表的是一个 class；在回归问题中，CART 的每一片叶子表示的是一个预测值，取值是连续的。Decision Tree regressor给定一个数据集 D=(x1,y1),(x2,y2),.,(xi,yi),.,(xn,yn)，其中 xi 是一个 m 维的向量，即 xi 含有 m 个 features。回归问题的目标就是构造一个函数 f(x)能够拟合数据集 D 中的

33、元素，使得 mse 最小，即：假设一棵构建好的 CART 回归树有 M 片叶子，这意味着 CART 将输入空间 x 划分成了 M 个单元 R1,R2,.,RM，同时意味着 CART 至多会有 M 个不同的预测值。CART 最小化 mse 公式如下：Decision Tree regressor在每一次的划分中，选择切分变量（splitting variable）和切分点（splitting point）时（也就是选择 feature 和将该 feature space 一分为二的 split），使得模型在训练集上的 mse 最小，也就是每片叶子的 mse 之和最小。采用启发式的方法，遍历所有的

34、切分变量和切分点，然后选出叶子节点 mse 之和最小的那种情况作为划分。选择第 j 个 feature x(j)和它取的值 s，作为切分变量和切分点，则切分变量和切分点将父节点的输入空间一分为二：CART 选择切分变量 j 和切分点 s 的公式如下：Decision Tree regressor in sklearnclass sklearn.tree.DecisionTreeRegressor(criterion=mse,splitter=best,max_depth=None,min_samples_split=2,min_samples_leaf=1,min_weight_fract

35、ion_leaf=0.0,max_features=None,random_state=None,max_leaf_nodes=None,min_impurity_decrease=0.0,min_impurity_split=None,presort=False)https:/scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeRegressor.html#sklearn.tree.DecisionTreeRegressorsklearn.tree.DecisionTreeRegressor from skle

36、arn import tree X=0,0,2,2 y=0.5,2.5 clf=tree.DecisionTreeRegressor()clf=clf.fit(X,y)clf.predict(1,1)array(0.5)sklearn.tree.DecisionTreeRegressor from sklearn.datasets import load_boston from sklearn.cross_validation import cross_val_score from sklearn.tree import DecisionTreeRegressor boston=load_boston()regressor=DecisionTreeRegressor(random_state=0)cross_val_score(regressor,boston.data,boston.target,cv=10).array(0.61.,0.57.,-0.34.,0.41.,0.75.,0.07.,0.29.,0.33.,-1.42.,-1.77.)

展开阅读全文