1、Data Mining Classification: Basic Concepts, Decision Trees, and Model EvaluationLecture Notes for Chapter 4Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 1 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 2 Classification: Defi
2、nitionlGiven a collection of records (training set ) Each record contains a set of attributes, one of the attributes is the class.lFind a model for class attribute as a function of the values of other attributes.lGoal: previously unseen records should be assigned a class as accurately as possible. A
3、 test set is used to determine the accuracy of the model. Usually, the given data set is divided into training and test sets, with training set used to build the model and test set used to validate it. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 3 Illustrating Classification Task Tan,
4、Steinbach, Kumar Introduction to Data Mining 4/18/2004 4 Examples of Classification TasklPredicting tumor cells as benign or malignantlClassifying credit card transactions as legitimate or fraudulentlClassifying secondary structures of protein as alpha-helix, beta-sheet, or random coillCategorizing
5、news stories as finance, weather, entertainment, sports, etc Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 5 Classification TechniqueslDecision Tree based MethodslRule-based MethodslMemory based reasoninglNeural NetworkslNave Bayes and Bayesian Belief NetworkslSupport Vector Machines Ta
6、n,Steinbach, Kumar Introduction to Data Mining 4/18/2004 6 Example of a Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcat
7、egoricalcontinuousclassRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KSplitting AttributesTraining DataModel: Decision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 7 Another Example of Decision TreeTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KN
8、o3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassMarStRefundTaxIncYESNONONOYesNoMarried Single, Divorced 80KThere could be more than one tree that fits the same data! Tan,Steinbach,
9、Kumar Introduction to Data Mining 4/18/2004 8 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 9 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10
10、Test DataStart from the root of tree. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 10 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mi
11、ning 4/18/2004 11 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 12 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried S
12、ingle, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 13 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10
13、Test Data Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 14 Apply Model to Test DataRefundMarStTaxIncYESNONONOYesNoMarried Single, Divorced 80KRefund Marital Status Taxable Income Cheat No Married 80K ? 10 Test DataAssign Cheat to “No” Tan,Steinbach, Kumar Introduction to Data Mining 4/1
14、8/2004 15 Decision Tree Classification TaskDecision Tree Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 16 Decision Tree InductionlMany Algorithms: Hunts Algorithm (one of the earliest) CART ID3, C4.5 SLIQ,SPRINT Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 17 General Struc
15、ture of Hunts AlgorithmlLet Dt be the set of training records that reach a node tlGeneral Procedure: If Dt contains records that belong the same class yt, then t is a leaf node labeled as yt If Dt is an empty set, then t is a leaf node labeled by the default class, yd If Dt contains records that bel
16、ong to more than one class, use an attribute test to split the data into smaller subsets. Recursively apply the procedure to each subset.Dt? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 18 Hunts AlgorithmDont CheatRefundDont CheatDont CheatYesNoRefundDont CheatYesNoMaritalStatusDont Ch
17、eatCheatSingle,DivorcedMarriedTaxableIncomeDont Cheat= 80KRefundDont CheatYesNoMaritalStatusDont CheatCheatSingle,DivorcedMarriedTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9No
18、Married75KNo10NoSingle90KYes10 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 19 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to de
19、termine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 20 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute
20、 test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 21 How to Specify Test Condition?lDepends on attribute types Nominal Ordinal ContinuouslDepends on number of ways to split 2-way split Multi-way split Tan,Ste
21、inbach, Kumar Introduction to Data Mining 4/18/2004 22 Splitting Based on Nominal AttributeslMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.CarTypeFamilySportsLuxuryCarTypeFamily, LuxurySportsCarTypeSports
22、, LuxuryFamilyOR Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 23 lMulti-way split: Use as many partitions as distinct values. lBinary split: Divides values into two subsets. Need to find optimal partitioning.lWhat about this split?Splitting Based on Ordinal AttributesSizeSmallMediumLar
23、geSizeMedium, LargeSmallSizeSmall, MediumLargeORSizeSmall, LargeMedium Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 24 Splitting Based on Continuous AttributeslDifferent ways of handling Discretization to form an ordinal categorical attributeu Static discretize once at the beginningu D
24、ynamic ranges can be found by equal interval bucketing, equal frequency bucketing(percentiles), or clustering. Binary Decision: (A v) or (A v)u consider all possible splits and finds the best cutu can be more compute intensive Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 25 Splitting B
25、ased on Continuous Attributes Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 26 Tree InductionlGreedy strategy. Split the records based on an attribute test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to det
26、ermine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 27 How to determine the Best SplitBefore Splitting: 10 records of class 0,10 records of class 1Which test condition is the best? Tan,Steinbach, Kumar Introduction to Data Mining 4/18/20
27、04 28 How to determine the Best SplitlGreedy approach: Nodes with homogeneous class distribution are preferredlNeed a measure of node impurity:Non-homogeneous,High degree of impurityHomogeneous,Low degree of impurity Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 29 Measures of Node Impu
28、ritylGini IndexlEntropylMisclassification error Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 30 How to Find the Best SplitB?YesNoNode N3Node N4A?YesNoNode N1Node N2Before Splitting:M0M1M2M3M4M12M34Gain = M0 M12 vs M0 M34 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 31 Mea
29、sure of Impurity: GINIlGini Index for a given node t :(NOTE: p( j | t) is the relative frequency of class j at node t). Maximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting information Minimum (0.0) when all records belong to one class, implying most
30、interesting informationjtjptGINI2)|(1)(C10C26Gini=0.000C12C24Gini=0.444C13C23Gini=0.500C11C25Gini=0.278 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 32 Examples for computing GINIC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Gini = 1 P(C1)2 P(C2)2 = 1 0 1 = 0 jtjptGINI2)
31、|(1)(P(C1) = 1/6 P(C2) = 5/6Gini = 1 (1/6)2 (5/6)2 = 0.278P(C1) = 2/6 P(C2) = 4/6Gini = 1 (2/6)2 (4/6)2 = 0.444 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 33 Splitting Based on GINIlUsed in CART, SLIQ, SPRINT.lWhen a node p is split into k partitions (children), the quality of split
32、is computed as,where,ni = number of records at child i, n = number of records at node p.kiisplitiGINInnGINI1)( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 34 Binary Attributes: Computing GINI IndexlSplits into two partitionslEffect of Weighing partitions: Larger and Purer Partitions a
33、re sought for.B?YesNoNode N1Node N2 Parent C1 6 C2 6 Gini = 0.500 N1 N2 C1 5 1 C2 2 4 Gini=0.333 Gini(N1) = 1 (5/6)2 (2/6)2 = 0.194 Gini(N2) = 1 (1/6)2 (4/6)2 = 0.528Gini(Children) = 7/12 * 0.194 + 5/12 * 0.528= 0.333 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 35 Categorical Attribut
34、es: Computing Gini IndexlFor each distinct value, gather counts for each class in the datasetlUse the count matrix to make decisionsCarTypeSports,LuxuryFamilyC131C224Gini0.400CarTypeSportsFamily,LuxuryC122C215Gini0.419CarTypeFamily Sports LuxuryC1121C2411Gini0.393Multi-way splitTwo-way split (find b
35、est partition of values) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 36 Continuous Attributes: Computing Gini IndexlUse Binary Decisions based on one valuelSeveral Choices for the splitting value Number of possible splitting values = Number of distinct valueslEach splitting value has
36、a count matrix associated with it Class counts in each of the partitions, A v and A vlSimple method to choose best v For each v, scan the database to gather count matrix and compute its Gini index Computationally Inefficient! Repetition of work. Tan,Steinbach, Kumar Introduction to Data Mining 4/18/
37、2004 37 Continuous Attributes: Computing Gini Index.lFor efficient computation: for each attribute, Sort the attribute on values Linearly scan these values, each time updating the count matrix and computing gini index Choose the split position that has the least gini indexCheatNoNoNoYesYesYesNoNoNoN
38、oTaxable Income60707585909510012012522055657280879297110122172230Yes0303030312213030303030No0716253434343443526170Gini0.4200.4000.3750.3430.4170.4000.3000.3430.3750.4000.420Split PositionsSorted Values Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 38 Alternative Splitting Criteria based
39、 on INFOlEntropy at a given node t:(NOTE: p( j | t) is the relative frequency of class j at node t). Measures homogeneity of a node. uMaximum (log nc) when records are equally distributed among all classes implying least informationuMinimum (0.0) when all records belong to one class, implying most i
40、nformation Entropy based computations are similar to the GINI index computationsjtjptjptEntropy)|(log)|()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 39 Examples for computing EntropyC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Entropy = 0 log 0 1 log 1 = 0 0 = 0 P(C1
41、) = 1/6 P(C2) = 5/6Entropy = (1/6) log2 (1/6) (5/6) log2 (1/6) = 0.65P(C1) = 2/6 P(C2) = 4/6Entropy = (2/6) log2 (2/6) (4/6) log2 (4/6) = 0.92jtjptjptEntropy)|(log)|()(2 Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 40 Splitting Based on INFO.lInformation Gain: Parent Node, p is split i
42、nto k partitions;ni is number of records in partition i Measures Reduction in Entropy achieved because of the split. Choose the split that achieves most reduction (maximizes GAIN) Used in ID3 and C4.5 Disadvantage: Tends to prefer splits that result in large number of partitions, each being small bu
43、t pure.kiisplitiEntropynnpEntropyGAIN1)()( Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 41 Splitting Based on INFO.lGain Ratio: Parent Node, p is split into k partitionsni is the number of records in partition i Adjusts Information Gain by the entropy of the partitioning (SplitINFO). H
44、igher entropy partitioning (large number of small partitions) is penalized! Used in C4.5 Designed to overcome the disadvantage of Information GainSplitINFOGAINGainRATIOSplitsplitkiiinnnnSplitINFO1log Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 42 Splitting Criteria based on Classifica
45、tion ErrorlClassification error at a node t :lMeasures misclassification error made by a node. uMaximum (1 - 1/nc) when records are equally distributed among all classes, implying least interesting informationuMinimum (0.0) when all records belong to one class, implying most interesting information)
46、|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 43 Examples for Computing ErrorC1 0 C2 6 C1 2 C2 4 C1 1 C2 5 P(C1) = 0/6 = 0 P(C2) = 6/6 = 1Error = 1 max (0, 1) = 1 1 = 0 P(C1) = 1/6 P(C2) = 5/6Error = 1 max (1/6, 5/6) = 1 5/6 = 1/6P(C1) = 2/6 P(C2) = 4/6Error = 1 max (
47、2/6, 4/6) = 1 4/6 = 1/3)|(max1)(tiPtErrori Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 44 Comparison among Splitting CriteriaFor a 2-class problem: Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 45 Misclassification Error vs GiniA?YesNoNode N1Node N2 Parent C1 7 C2 3 Gini
48、= 0.42 N1 N2 C1 3 4 C2 0 3 Gini=0.361 Gini(N1) = 1 (3/3)2 (0/3)2 = 0 Gini(N2) = 1 (4/7)2 (3/7)2 = 0.489Gini(Children) = 3/10 * 0 + 7/10 * 0.489= 0.342Gini improves ! Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 46 Tree InductionlGreedy strategy. Split the records based on an attribute
49、test that optimizes certain criterion.lIssues Determine how to split the recordsuHow to specify the attribute test condition?uHow to determine the best split? Determine when to stop splitting Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 47 Stopping Criteria for Tree InductionlStop expa
50、nding a node when all the records belong to the same classlStop expanding a node when all the records have similar attribute valueslEarly termination (to be discussed later) Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 Decision Tree Based ClassificationlAdvantages: Inexpensive to co