商务智能课件：第4章分类.ppt_163文库

资源描述

1、Principles and Applications of Business IntelligenceChap 4 ：分类 1Introduction to商务智能方法与应用Liu HongyanPrinciples and Applications of Business Intelligence第4章分类Chapter 4: ClassificationPrinciples and Applications of Business IntelligenceChap 4 ：分类 2信息管理学院数据挖掘十大算法v The k-means algorithm v The Apriori

2、algorithm v ExpectationMaximization v PageRank v AdaBoost Principles and Applications of Business IntelligenceChap 4 ：分类 3主要内容 4.1 概念 4.2 决策树分类方法 4.3 朴素贝叶斯分类方法 4.4 k近邻分类方法 4.5 分类性能的度量Principles and Applications of Business IntelligenceChap 4 ：分类 44.1 基本概念Principles and Applications of Business Int

3、elligenceChap 4 ：分类 5信息管理学院分类（classification）：总结已有类别的对象的特点并进而进行未知类别对象的类别预测的过程用给定的训练集用来建立一个分类模型（或称分类器），所建立的分类模型用来预测数据库中类标号未知的数据元组的类别。训练数据集由一组数据库元组（称为训练样本、实例或对象）构成样本形式为（v1，v2，vn；c），其中vi表示属性值，c表示类标号。分类及其相关的基本概念Principles and Applications of Business IntelligenceChap 4 ：分类 6分类及其相关的基本概念分类器（classifier

4、) 训练数据集（training dataset）- 分类属性（class label attribute），每个取值称为一个类别（class label）- 属性，用于描述一个对象的某个特性或性质测试数据集（testing dataset）Principles and Applications of Business IntelligenceChap 4 ：分类 7信息管理学院分类属于有监督学习还是无监督学习？分类属于有监督学习还是无监督学习？? 有监督学习有监督学习（classification）训练集是带有类标签的；新的数据是基于训练集进行分类的? 无监督学习无监督学习（c

5、lustering）训练集是没有类标签的；提供一组属性，然后寻找出训练集中存在的类别或者聚集Principles and Applications of Business IntelligenceChap 4 ：分类 8信息管理学院Principles and Applications of Business IntelligenceChap 4 ：分类 9分类及其相关的基本概念客户编号年龄性别年收入（万）婚姻豪华车130女86已婚否230男65单身否330男90离异否450女96离异否1150女80单身否1250男50单身是1350女80离异否1450男92离异是分类属性类别训练数据集

6、属性Principles and Applications of Business IntelligenceChap 4 ：分类 10分类方法 Lazy Eager- 构建模型- 测试、使用模型Principles and Applications of Business IntelligenceChap 4 ：分类 11分类：构建模型TrainingDataClassificationAlgorithmsIF rank = professorOR years 6THEN tenured = yes Classifier(Model)Principles and Applications

7、of Business IntelligenceChap 4 ：分类 12NAMERANKYEARS TENUREDTomAssistant Prof2noMerlisa Associate Prof7noGeorge Professor5yesJoseph Assistant Prof7yesTestingDataUnseen Data(Jeff, Professor, 4)ClassifierTenured?分类：测试分类模型并预测Principles and Applications of Business IntelligenceChap 4 ：分类 13If age=“30-40

8、” and income=High then credit_rating=excellent分类规则分类规则未知数据未知数据incomeincomeage?exfexfex40 30-40highlow，medlow，medhigh决策树决策树检验集检验集训练集训练集训练集训练集检验集检验集模型模型未知数据未知数据分类的概念与过程 Principles and Applications of Business IntelligenceChap 4 ：分类 14分类技术决策树（decision tree）朴素贝叶斯（Nave Bayes） K近邻（K nearest Neighbors）

9、基于关联的分类支持向量机（Support Vector Machines ）人工神经网络 Logistic Regression Principles and Applications of Business IntelligenceChap 4 ：分类 154.2 决策树分类方法Principles and Applications of Business IntelligenceChap 4 ：分类 164.2 决策树分类方法 4.2.1 决策树的构建过程 4.2.2 属性的类型及分裂条件 4.2.3 决策树的剪枝Principles and Applications of Bu

10、siness IntelligenceChap 4 ：分类 17决策树的概念决策树- 叶子节点：类别- 其余节点：测试属性- 树的层次根结点的层次为1 根结点的子女结点的层次为2 - 边：一种基于此结点属性的判断（分裂）条件根节点根节点叶子节点叶子节点双亲节点双亲节点子女节点子女节点决策树（decision tree）是一个类似于流程图的树结构。树的最顶层节点是根节点，根节点与每个内部节点表示数据集合在某个属性上的测试，每个分枝代表一个数据子集的输出，而每个树叶节点代表类或类分布。Principles and Applications of Business IntelligenceCh

11、ap 4 ：分类 18信息管理学院40yesnoexcellentfairstudentcredit_ratingbuys_computernofairnonoexcellentyesyesage？ nonoyesyescredit-rating？ student？ Principles and Applications of Business IntelligenceChap 4 ：分类 19信息管理学院categoricalcategoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced

12、80KSplitting Attributes训练数据训练数据模型模型: 决策树决策树决策树分类实例Principles and Applications of Business IntelligenceChap 4 ：分类 20信息管理学院应用决策树进行分类RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 测试数据测试数据Start from the root of tree.Principles

13、and Applications of Business IntelligenceChap 4 ：分类 21信息管理学院应用决策树进行分类RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 测试数据测试数据Principles and Applications of Business IntelligenceChap 4 ：分类 22信息管理学院应用决策树进行分类RefundMarStTaxIncYE

14、SNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 测试数据测试数据Principles and Applications of Business IntelligenceChap 4 ：分类 23信息管理学院应用决策树进行分类RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Mar

15、ried 80K ? 10 测试数据测试数据Principles and Applications of Business IntelligenceChap 4 ：分类 24信息管理学院应用决策树进行分类RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 测试数据测试数据Principles and Applications of Business IntelligenceChap 4 ：分类 25信

16、息管理学院应用决策树进行分类RefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 测试数据测试数据Assign Cheat to “No”Principles and Applications of Business IntelligenceChap 4 ：分类 26信息管理学院构造决策树的方法采用自上而下递归的方式，如果：一个节点（训练集的子集合）上的数据都是属于同一个类别没有属性可以再用于对数据

17、进行分割就将其作为一个叶子节点。否则，根据某种策略选择一个分裂属性，并按该属性的取值把实例集合划分为若干个子集合。并继续递归处理各子集。可基于启发式规则或者统计的度量， ID3算法选用最大信息增益法选择分裂属性决策树的构建过程决策树生成算法分成两个步骤：树的生成起始时,数据都在根节点；采用递归方式进行数据分片树的修剪去掉一些可能是噪音或者异常的数据Principles and Applications of Business IntelligenceChap 4 ：分类 27决策树的构建过程主要步骤：训练数据集D，类别集合C=c1, c2, , ck1. 创建一个结点t，初始

18、情况下训练数据集中的所有样本与根结点关联，记为Dt。将t设为当前结点。2. 如果当前结点t所关联的数据集Dt中所有样本的类别相同（假设为ci）, 则将该结点标记为叶子节点，记录类别为ci，停止对该结点所关联的数据集的进一步分裂。接着处理其他非叶子节点。否则，进入下一步。3. 为数据集Dt选择分裂属性和分裂条件。根据分裂条件将数据集Dt分裂为m个子集，为结点t创建m个子女结点，将这m个数据集分别与之关联。依次将每个结点设为当前结点，转至步骤2进行处理，直至所有结点都标记为叶子结点。Principles and Applications of Business IntelligenceChap 4

19、：分类 28决策树的构建奥卡姆剃刀（Occams Razor）原理： “如无必要，勿增实体”（Entities should not be multiplied unnecessarily）- 一棵小的树的预测能力更好采用分而治之的思想，利用贪心策略从局部出发来构造一棵大小紧凑的决策树。 Hunt、ID3、C4.5、CARTPrinciples and Applications of Business IntelligenceChap 4 ：分类 29信息管理学院决策树的构建过程算法：Generate_decision_tree由给定的训练数据产生一棵决策树输入：训练样本sampl

20、es，由离散值属性表示；侯选属性的集合attribute_list输出：一棵决策树方法：创建节点N； if samples都在同一个类 C then 返回N作为叶节点，以类C标记； if attribute_list 为空 then 返回N作为叶节点，标记为samples中最普通的类：/多数表决Principles and Applications of Business IntelligenceChap 4 ：分类 30信息管理学院选择attribute_list 中具有最高信息增益的属性test_attribute；标记节点N为test_attribute ； for each t

21、est_attribute中的已知值ai /划分samples 由节点N长出一个条件为test_attribute =ai的分枝；设si是samples中的test_attribute =ai的样本集合;/划分 if si为空 then 加上一个树叶，标记为samples中最普通的类： else 加上一个由Generate_decision_tree（si，attribute - list -test - attribute）返回的节点决策树的构建过程Principles and Applications of Business IntelligenceChap 4 ：分类 31信息管理

22、学院决策树分类原理Procedure BuildTree(S，A) (S：训练样本集，A：分类属性集合）用样本集S创建节点N if A为空 then 返回N，标记为S中最普遍的类 if N pure then 返回N else for 每一个属性 A 估计该节点在A上的信息增益选出最佳的属性A*，将S分裂为Si ，N长出分支 for each Si if Si 为空 then 返回叶节点，标记为S中最普遍的类 else BuildTree(Si,A-A*)决策树的构建过程Principles and Applications of Business IntelligenceChap 4 ：

23、分类 32信息管理学院分裂属性选择在树的每个节点上使用信息增益度量选择测试属性。这种度量称作属性选择度量。选择具有最高信息增益的属性作为当前节点的测试属性。该属性使得对结果划分中的样本分类所需的信息量最小，并反映划分的最小随机性或不纯性。这种信息论的方法使得对一个对象分类所需的期望测试数目达到最小，并确保找到一棵简单（不必最简单）的树第二节决策树分类Principles and Applications of Business IntelligenceChap 4 ：分类 33信息管理学院gpatypehand_in paperslowfull-timenohighpart-timeye

24、sPrinciples and Applications of Business IntelligenceChap 4 ：分类 34信息管理学院highlowfemalemale属性属性 gender?属性属性 gpa?Principles and Applications of Business IntelligenceChap 4 ：分类 35决策树的构建过程分裂属性和分裂条件的选择分裂属性的选择通常利用类别纯度的衡量作为标准，信息熵和gini指数两种ABclasscount00-5001-5010+011+100A+-01B-01A+-01Principles and Appl

25、ications of Business IntelligenceChap 4 ：分类 36信息管理学院D是训练样本集合。假定类标号属性具有m个不同值ci (i=1，m)。设pi是D中任意元组属于类ci中的概率。对一个D中的元组分类所需的期望信息为（实现完全划分的信息需求、不纯性度量）：?信息熵:消除不确定性所需的信息量 (bit), 1比特的信息量指含有两个独立均等概率状态的事件不确定性被全部消除所需的信息。21logmiiiInfo Dpp -分裂属性选择Principles and Applications of Business IntelligenceChap 4 ：分类 37分

26、类属性的选择信息熵entropy(D) 数据集D及类别集合C=c1, c2, , ck count(ci)：类别ci在D中出现的次数， p(ci)：ci在D中出现的相对频率p(ci)=count(ci)/|D|- |D|代表D中的数据行数21entropy( )( )log( )kiiiDp cp c -Principles and Applications of Business IntelligenceChap 4 ：分类 38 若两个类别均匀分布(0.5,0.5) Entropy =?1-+-1+1-+1122S( )( )( )( )p ccp cc -E( )Entropy(S)

27、log(p)log(p)111信息熵若所有行属于同一类别: Entropy =?Principles and Applications of Business IntelligenceChap 4 ：分类 39信息管理学院Info(D)=1Info(D)=0Info(D)= 0.652122log1111=loglog=12222miiiInfo Dpp -2122log1155=loglog=0.656666miiiInfo Dpp -2122log=0 log 0 1 log 1=0miiiInfo Dpp - - Principles and Applications of Busi

28、ness IntelligenceChap 4 ：分类 40信息管理学院Info(D)=1Gini=0.5Error=0.5Info(D)=0Gini=0Error=0Info(D)= 0.65Gini=0.278Error=0.167 21Gini1miitp -_1 maxiiClassificationerrorp -Principles and Applications of Business IntelligenceChap 4 ：分类 41分类属性的选择按属性A分裂的信息熵entropy(D, A) 数据集D按照属性A的分裂条件分裂出的m个子数据集分别为D1, D2, Dm，

29、则entropy(D, A)综合这m个子数据集的信息熵就可以作为衡量一个属性A优劣的度 gain(D,A)：一个数据集D按属性A分裂前后信息熵的差值信息增益（information gain）gain(D,A)=entropy(D)-entropy(D,A) 1|entropy( , )()|kiiiDD Aentropy DDPrinciples and Applications of Business IntelligenceChap 4 ：分类 42 初始样本集合22122log=0.4log 0.40.6log 0.6=0.97iiiInfo Dpp -20gpaInfoD10.72

30、gpaInfoD 12550.361010gpagpagpaInfoDInfoDInfoD0.61gpaGain gpaInfo DInfoD-Principles and Applications of Business IntelligenceChap 4 ：分类 4322122log=0.4log 0.40.6log 0.6=0.97iiiInfo Dpp -11genInfoDgen20.92InfoDhighlowfemalemale20gpaInfoD10.72gpaInfoD 12460.951010gengengenInfoDInfoDInfoD 12550.361010gp

31、agpagpaInfoDInfoDInfoDgender?gpa ? 0.02genGain genInfo DInfoD- 0.61gpaGain gpaInfo DInfoD-接上例Principles and Applications of Business IntelligenceChap 4 ：分类 44信息管理学院highlowgpatypehand_in paperslowfull-timenohighpart-timeyestypehand_in papersfull-timenopart-timeyes？Principles and Applications of Busi

32、ness IntelligenceChap 4 ：分类 45信息管理学院第二节决策树分类incomestudentcredit_ratingbuys_computerhighnofairnohighnoexcellentnohighnofairyesmediumnofairyeslowyesfairyeslowyesexcellentnolowyesexcellentyesmediumnofairnolowyesfairyesmediumyesfairyesmediumyesexcellentyesmediumnoexcellentyeshighyesfairyesmediumnoexce

33、llentnoPrinciples and Applications of Business IntelligenceChap 4 ：分类 46信息管理学院buys_computernonoyesyesyesnoyesnoyesyesyesyesyesno229955loglog0.94014141414 -221logiiiInfo Dpp -Principles and Applications of Business IntelligenceChap 4 ：分类 47信息管理学院income studentCredit_ratingBuys_computerhighnofairnoh

34、ighnoexcellentnomediumnofairnolowyesfairyesmediumyesexcellentyesincome studentCredit_ratingBuys_computerhighnofairyeslowyesexcellentyesmediumnoexcellentyeshighyesfairyesincome studentCredit_ratingBuys_computermediumnofairyeslowyesfairyeslowyesexcellentnomediumyesfairyesmediumnoexcellentnoPrinciples

35、and Applications of Business IntelligenceChap 4 ：分类 48信息管理学院income studentCredit_ratingBuys_computerhighnofairnohighnoexcellentnomediumnofairnolowyesfairyesmediumyesexcellentyesincome studentCredit_ratingBuys_computerhighnofairyeslowyesexcellentyesmediumnoexcellentyeshighyesfairyesincome studentCre

36、dit_ratingBuys_computermediumnofairyeslowyesfairyeslowyesexcellentnomediumyesfairyesmediumnoexcellentno233231logageiiiInfoDpp -223322loglog5555 -211211logageiiiInfoDpp -222233loglog5555 -222221log0ageiiiInfoDpp - 1235451414140.694ageageageageInfoDInfoDInfoDInfoD 0.940 0.694 0.264ageGain ageInfo DInf

37、oD-Principles and Applications of Business IntelligenceChap 4 ：分类 49信息管理学院同样可以计算其它属性信息增益，得到age信息增益最大并选作分裂属性。用age来标记节点，并对每个属性值引出分枝?40yesage？Principles and Applications of Business IntelligenceChap 4 ：分类 5050基尼指数（基尼指数（Gini Index）如果集合T分成两部分 N1 and N2 。那么这个分割的Gini就是提供最小Ginisplit 就被选择作为分割的标准.)()()(22

38、11TginiNNTginiNNTginisplitGini指标在指标在CART中使用，并考虑每个属性的二元划分中使用，并考虑每个属性的二元划分 21Gini1miitp -Principles and Applications of Business IntelligenceChap 4 ：分类 51其他分裂方法 CART算法：限定每次对数据集的分裂都是二分的若属性有个不同的取值a、b和c，则组合有3种情况 a和b，c、 a，b和c及 a，c和bPrinciples and Applications of Business IntelligenceChap 4 ：分类 52信息增益的调

39、整-增益比率以属性“年龄”为例，分成的3个数据集的大小分别为4、5、5 属性“年龄”的增益比率则为0.24/1.58=0.1521( , )( , )gain_( , A)=|( , )log ()|miiigain D Again D Aratio DDDsplitInfo D ADD-222445555( , )logloglog1.58141414141414splitInfo D A -增益率（增益率（gain ratio）在）在C4.5中使用中使用Principles and Applications of Business IntelligenceChap 4 ：分类 544.

40、2.2 属性的类型及分裂条件为了减少信息增益，需要确定根据属性A对数据集的分裂方法属性的分类- 定量（quantitative）和定性（qualitative）定量属性又称为数值（numerical）属性，每个取值为数值，既可以比较大小，又可以进行数值运算，如加、减、乘、除等。如：“年收入” 定性属性又称为类别（categorical）属性，其取值不具有数的特点。定性属性又可以分为标称（nominal）属性和序数（ordinal）属性- 属性从另一个角度又可以分为离散（discrete）属性和连续（continuous）属性Principles and Applications of Bu

41、siness IntelligenceChap 4 ：分类 55定性属性的分裂条件一个数据集D若根据一个定性属性A进行分裂，假设A在D中的取值由集合VA表示，VA=a1，a1，am ，则分裂条件为A=ai 例如- 若按属性“婚姻”进行数据集分裂，则分裂条件为婚姻=单身、婚姻=已婚和婚姻=离异- entropy(D, 年龄)=0.69，entropy(D,性别)=0.89312123222222|entropy( ,)()()()|411334222262244=(loglog)(loglog)(loglog)1444441444441466660.91DDDDentropy Dentrop

42、y Dentropy DDDD-婚姻Principles and Applications of Business IntelligenceChap 4 ：分类 56按属性“婚姻”进行数据集分裂客户编号年龄性别年收入（万）婚姻豪华车250女80单身否1250男50单身是客户编号年龄性别年收入（万）婚姻豪华车130女86已婚否430女75已婚否530-50女82已婚是630-50男91已婚是客户编号年龄性别年收入（万）婚姻豪华车350女96离异否1350女80离异否1450男92离异是Principles and Applications of Business IntelligenceCh

43、ap 4 ：分类 57定量属性的分裂条件（1）属性及分类属性抽出并按年收入升序排序对于定量属性A，设A在数据集D中有m个不同的取值，a1a1 ai ，其中1i 0) 0，(i (i1 1，n)n)，则对任何事件则对任何事件B B S S，有有 ),.,1( ,)|()()|()()|(1njABPAPABPAPBAPniiijjj称为贝叶斯公式贝叶斯公式。Principles and Applications of Business IntelligenceChap 4 ：分类 68信息管理学院朴素贝叶斯分类：假定一个属性值对给定类的影响独立于其他属性的值，这一假定称作类条件独立。

44、做此假定是为了简化计算，并在此意义下被称为“朴素的” 贝叶斯信念网络：是图形模型，可以表示属性子集间的依赖贝叶斯分类主要包括：Principles and Applications of Business IntelligenceChap 4 ：分类 69朴素贝叶斯分类给定一个样本变量X的一个观察到的样本x，由n个属性A1, A2, , An描述，其属性取值分别为x1, x2, xn，即x=(x1, x2, xn)，要判断其所属的类别，即类别属性Y的取值， C=c1, c2, , ck 贝叶斯定理：12argmax(| )argmax(|,.)iicCicCincP cxP cx xx

45、( |) ( )(| )( )iiiP x c p cP cxp xargmax(| )argmax( |) ( )iicCicCiicP cxP x c P cPrinciples and Applications of Business IntelligenceChap 4 ：分类 70朴素贝叶斯分类假设给定类别变量取值一定的情况下，各个属性取值之间互相独立，则 argmax(| )argmax( |) ( )iicCicCiicP cxP x c P c).).)iiiiiiiiniP xxxcP xxxcP xxcP xxxcP xxxcP xxcP xxxcP xxxcP xxc

46、P xc12n12n2n12n23ni3n12n23n3n(, . |) (|, .,)(, .| (|, . ,)(|, . ,(, . | (|, . ,)(|, . ,(, . |(|).)iniiinijijP x xxcP xc P xc P xcP xcP xc12n1231(,. |)(|) (|(|(|(|)1argmax(| )argmax(|) ( )iincCicCjiijcP cxP xc P cPrinciples and Applications of Business IntelligenceChap 4 ：分类 71概率计算 P(ci)可以用训练数据集中类别c

47、i出现的次数占训练数据集总行数的比例来近似对于定性属性，P(xj|ci)可以通过计算类别为ci的样本中属性Aj取值为xj的样本所占比例来近似对于定量属性，有两种方法。- 一种方法是先将该属性取值离散化- 假设变量服从某种概率分布，通过训练数据集估计分布的参数1argmax(| )argmax(|) ( )iincCicCjiijcP cxP xc P cPrinciples and Applications of Business IntelligenceChap 4 ：分类 72概率计算对于定量属性，有两种方法。- 假设变量服从正态分布N（，2）。计算P(xj|ci)时，在类别为ci

48、的样本中为属性Aj（xj是属性Aj的取值）的取值计算均值ij和标准差ij，然后利用下面的公式进行近似估计1argmax(| )argmax(|) ( )iincCicCjiijcP cxP xc P c22()21(|)2jijijxujiijP xce-x=（年龄30，男，年收入30万，单身），要预测其是否购买豪华车客户编号年龄性别年收入（万）婚姻豪华车130女86已婚否230男65单身否330男90离异否450女96离异否1150女80单身否1250男50单身是1350女80离异否1450男92离异是P(是是)=5/14, P(否否)=9/14P(年龄年龄=30-50|是是)=3/5P(年

50、22(30 70)2 25(|)(30 |)(=|)(=|)(=30 |)()43319 =99914225 =0.00014PxPPPPPe-否年龄否性别男否婚姻单身否年收入否否Principles and Applications of Business IntelligenceChap 4 ：分类 74平滑处理在计算P(是|x)时，由于年龄 P(X|Cj) P(Cj)，1jm，j i，换言之，X被指派到其P(X|Ci) P(Ci)最大的类CiPrinciples and Applications of Business IntelligenceChap 4 ：分类 79Da

展开阅读全文

商务智能课件：第4章 分类.ppt

商务智能课件：第4章分类.ppt