ImageVerifierCode 换一换
格式:PPT , 页数:101 ,大小:3.81MB ,
文档编号:2040892      下载积分:15 文币
快捷下载
登录下载
邮箱/手机:
温馨提示:
系统将以此处填写的邮箱或者手机号生成账号和密码,方便再次下载。 如填写123,账号和密码都是123。
支付方式: 支付宝    微信支付   
验证码:   换一换

优惠套餐
 

温馨提示:若手机下载失败,请复制以下地址【https://www.163wenku.com/d-2040892.html】到电脑浏览器->登陆(账号密码均为手机号或邮箱;不要扫码登陆)->重新下载(不再收费)。

已注册用户请登录:
账号:
密码:
验证码:   换一换
  忘记密码?
三方登录: 微信登录  
下载须知

1: 试题类文档的标题没说有答案,则无答案;主观题也可能无答案。PPT的音视频可能无法播放。 请谨慎下单,一旦售出,概不退换。
2: 本站所有资源如无特殊说明,都需要本地电脑安装OFFICE2007和PDF阅读器。
3: 本文为用户(罗嗣辉)主动上传,所有收益归该用户。163文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(点击联系客服),我们立即给予删除!。
4. 未经权益所有人同意不得将文件中的内容挪作商业或盈利用途。
5. 本站仅提供交流平台,并不能对任何下载内容负责。
6. 下载文件中如有侵权或不适当内容,请与我们联系,我们立即纠正。
7. 本站不保证下载资源的准确性、安全性和完整性, 同时也不承担用户因使用这些下载资源对自己和他人造成任何形式的伤害或损失。

版权提示 | 免责声明

1,本文(数据挖掘课件:chap4-basic-classification.ppt)为本站会员(罗嗣辉)主动上传,163文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。
2,用户下载本文档,所消耗的文币(积分)将全额增加到上传者的账号。
3, 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(发送邮件至3464097650@qq.com或直接QQ联系客服),我们立即给予删除!

数据挖掘课件:chap4-basic-classification.ppt

1、Data Mining Classification: Basic Concepts, Decision Trees, and Model EvaluationLecture Notes for Chapter 4Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification: Defi

2、nitionlGiven a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.lFind a model for class attribute as a function of the values of other attributes.lGoal: previously unseen records should be assigned a class as accurately as possible. A

3、 test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Illustrating Classification Task Tan,

4、Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Examples of Classification TasklPredicting tumor cells as benign or malignantlClassifying credit card transactions as legitimate or fraudulentlClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coillCategorizing

5、news stories as finance, weather, entertainment, sports, etc Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Classification TechniqueslDecision Tree based MethodslRule-based MethodslMemory based reasoninglNeural NetworkslNave Bayes and Bayesian Belief NetworkslSupport Vector Machines Ta

6、n,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Example of a Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcat

7、egoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KSplitting AttributesTraining DataModel: Decision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Another Example of Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KN

8、o3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONONOYesNoMarried Single, Divorced 80KThere could be more than one tree that fits the same data! Tan,Steinbach,

9、Kumar Introduction to Data Mining 4/18/2004 8 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10

10、Test DataStart from the root of tree. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mi

11、ning 4/18/2004 11 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried S

12、ingle, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10

13、Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test DataAssign Cheat to “No” Tan,Steinbach, Kumar Introduction to Data Mining 4/1

14、8/2004 15 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Decision Tree InductionlMany Algorithms: Hunts Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 General Struc

15、ture of Hunts AlgorithmlLet Dt be the set of training records that reach a node tlGeneral Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that bel

16、ong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Hunts AlgorithmDont CheatRefundDont CheatDont CheatYesNoRefundDont CheatYesNoMaritalStatusDont Ch

17、eatCheatSingle,DivorcedMarriedTaxableIncomeDont Cheat= 80KRefundDont CheatYesNoMaritalStatusDont CheatCheatSingle,DivorcedMarriedTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9No

18、Married75KNo10NoSingle90KYes10 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to de

19、termine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute

20、 test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 How to Specify Test Condition?lDepends on attribute types Nominal Ordinal ContinuouslDepends on number of ways to split 2-way split Multi-way split Tan,Ste

21、inbach, Kumar Introduction to Data Mining 4/18/2004 22 Splitting Based on Nominal AttributeslMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.CarTypeFamilySportsLuxuryCarTypeFamily, LuxurySportsCarTypeSports

22、, LuxuryFamilyOR Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 lMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.lWhat about this split?Splitting Based on Ordinal AttributesSizeSmallMediumLar

23、geSizeMedium, LargeSmallSizeSmall, MediumLargeORSizeSmall, LargeMedium Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous AttributeslDifferent ways of handling Discretization to form an ordinal categorical attributeu Static discretize once at the beginningu D

24、ynamic ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering. Binary Decision: (A v) or (A v)u consider all possible splits and finds the best cutu can be more compute intensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Splitting B

25、ased on Continuous Attributes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to det

26、ermine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 How to determine the Best SplitBefore Splitting: 10 records of class 0,10 records of class 1Which test condition is the best? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/20

27、04 28 How to determine the Best SplitlGreedy approach: Nodes with homogeneous class distribution are preferredlNeed a measure of node impurity:Non-homogeneous,High degree of impurityHomogeneous,Low degree of impurity Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Measures of Node Impu

28、ritylGini IndexlEntropylMisclassification error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 How to Find the Best SplitB?YesNoNode N3Node N4A?YesNoNode N1Node N2Before Splitting:M0M1M2M3M4M12M34Gain = M0 M12 vs M0 M34 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Mea

29、sure of Impurity: GINIlGini Index for a given node t :(NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most

30、interesting informationjtjptGINI2)|(1)(C10C26Gini=0.000C12C24Gini=0.444C13C23Gini=0.500C11C25Gini=0.278 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Examples for computing GINIC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 jtjptGINI2)

31、|(1)(P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.444 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Splitting Based on GINIlUsed in CART, SLIQ, SPRINT.lWhen a node p is split into k partitions (children), the quality of split

32、is computed as,where,ni = number of records at child i, n = number of records at node p.kiisplitiGINInnGINI1)( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Binary Attributes: Computing GINI IndexlSplits into two partitionslEffect of Weighing partitions: Larger and Purer Partitions a

33、re sought for.B?YesNoNode N1Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini(N1) = 1 (5/6)2 (2/6)2 = 0.194 Gini(N2) = 1 (1/6)2 (4/6)2 = 0.528Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Categorical Attribut

34、es: Computing Gini IndexlFor each distinct value, gather counts for each class in the datasetlUse the count matrix to make decisionsCarTypeSports,LuxuryFamilyC131C224Gini0.400CarTypeSportsFamily,LuxuryC122C215Gini0.419CarTypeFamily Sports LuxuryC1121C2411Gini0.393Multi-way splitTwo-way split (find b

35、est partition of values) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Continuous Attributes: Computing Gini IndexlUse Binary Decisions based on one valuelSeveral Choices for the splitting value Number of possible splitting values = Number of distinct valueslEach splitting value has

36、a count matrix associated with it Class counts in each of the partitions, A v and A vlSimple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/

37、2004 37 Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini indexCheatNoNoNoYesYesYesNoNoNoN

38、oTaxable Income60707585909510012012522055657280879297110122172230Yes0303030312213030303030No0716253434343443526170Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420Split PositionsSorted Values Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Alternative Splitting Criteria based

39、 on INFOlEntropy at a given node t:(NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. uMaximum (log nc) when records are equally distributed among all classes implying least informationuMinimum (0.0) when all records belong to one class, implying most i

40、nformation Entropy based computations are similar to the GINI index computationsjtjptjptEntropy)|(log)|()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Examples for computing EntropyC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1

41、) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92jtjptjptEntropy)|(log)|()(2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Splitting Based on INFO.lInformation Gain: Parent Node, p is split i

42、nto k partitions;ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small bu

43、t pure.kiisplitiEntropynnpEntropyGAIN1)()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Splitting Based on INFO.lGain Ratio: Parent Node, p is split into k partitionsni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). H

44、igher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information GainSplitINFOGAINGainRATIOSplitsplitkiiinnnnSplitINFO1log Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Splitting Criteria based on Classifica

45、tion ErrorlClassification error at a node t :lMeasures misclassification error made by a node. uMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationuMinimum (0.0) when all records belong to one class, implying most interesting information)

46、|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Examples for Computing ErrorC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Error = 1 max (0, 1) = 1 1 = 0 P(C1) = 1/6 P(C2) = 5/6Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6P(C1) = 2/6 P(C2) = 4/6Error = 1 max (

47、2/6, 4/6) = 1 4/6 = 1/3)|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Comparison among Splitting CriteriaFor a 2-class problem: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Misclassification Error vs GiniA?YesNoNode N1Node N2 Parent C1 7 C2 3 Gini

48、= 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.361 Gini(N1) = 1 (3/3)2 (0/3)2 = 0 Gini(N2) = 1 (4/7)2 (3/7)2 = 0.489Gini(Children) = 3/10 * 0 + 7/10 * 0.489= 0.342Gini improves ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Tree InductionlGreedy strategy. Split the records based on an attribute

49、test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Stopping Criteria for Tree InductionlStop expa

50、nding a node when all the records belong to the same classlStop expanding a node when all the records have similar attribute valueslEarly termination (to be discussed later) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Decision Tree Based ClassificationlAdvantages: Inexpensive to co

侵权处理QQ:3464097650--上传资料QQ:3464097650

【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。


163文库-Www.163Wenku.Com |网站地图|