1、Data Mining:Concepts and Techniques Slides for Textbook Chapter 3 Jiawei Han and Micheline KamberIntelligent Database Systems Research LabSchool of Computing Science Simon Fraser University,Canadahttp:/www.cs.sfu.ca10/6/20221Chapter 3:Data PreprocessingnWhy preprocess the data?nData cleaning nData i
2、ntegration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary10/6/20222Why Data Preprocessing?nData in the real world is dirtynincomplete:lacking attribute values,lacking certain attributes of interest,or containing only aggregate datannoisy:containing errors o
3、r outliersninconsistent:containing discrepancies in codes or namesnNo quality data,no quality mining results!nQuality decisions must be based on quality datanData warehouse needs consistent integration of quality data10/6/20223Forms of data preprocessing 10/6/20226Chapter 3:Data PreprocessingnWhy pr
4、eprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary10/6/20227Data CleaningnData cleaning tasksnFill in missing valuesnIdentify outliers and smooth out noisy data nCorrect inconsistent data10/6/20228Missing Datan
5、Data is not always availablenE.g.,many tuples have no recorded value for several attributes,such as customer income in sales datanMissing data may be due to nequipment malfunctionninconsistent with other recorded data and thus deletedndata not entered due to misunderstandingncertain data may not be
6、considered important at the time of entrynnot register history or changes of the datanMissing data may need to be inferred.10/6/20229How to Handle Missing Data?nIgnore the tuple:usually done when class label is missing(assuming the tasks in classificationnot effective when the percentage of missing
7、values per attribute varies considerably.nFill in the missing value manually:tedious+infeasible?nUse a global constant to fill in the missing value:e.g.,“unknown”,a new class?!nUse the attribute mean to fill in the missing valuenUse the attribute mean for all samples belonging to the same class to f
8、ill in the missing value:smarternUse the most probable value to fill in the missing value:inference-based such as Bayesian formula or decision tree10/6/202210Noisy DatanNoise:random error or variance in a measured variablenIncorrect attribute values may due tonfaulty data collection instrumentsndata
9、 entry problemsndata transmission problemsntechnology limitationninconsistency in naming convention nOther data problems which requires data cleaningnduplicate recordsnincomplete dataninconsistent data10/6/202211How to Handle Noisy Data?nBinning method:nfirst sort data and partition into(equi-depth)
10、binsnthen one can smooth by bin means,smooth by bin median,smooth by bin boundaries,etc.nClusteringndetect and remove outliersnCombined computer and human inspectionndetect suspicious values and check by humannRegressionnsmooth by fitting the data into regression functions10/6/202212Simple Discretiz
11、ation Methods:BinningnEqual-width(distance)partitioning:nIt divides the range into N intervals of equal size:uniform gridnif A and B are the lowest and highest values of the attribute,the width of intervals will be:W=(B-A)/N.nThe most straightforwardnBut outliers may dominate presentationnSkewed dat
12、a is not handled well.nEqual-depth(frequency)partitioning:nIt divides the range into N intervals,each containing approximately same number of samplesnGood data scalingnManaging categorical attributes can be tricky.10/6/202213Binning Methods for Data Smoothing*Sorted data for price(in dollars):4,8,9,
13、15,21,21,24,25,26,28,29,34*Partition into(equi-depth)bins:-Bin 1:4,8,9,15 -Bin 2:21,21,24,25 -Bin 3:26,28,29,34*Smoothing by bin means:-Bin 1:9,9,9,9 -Bin 2:23,23,23,23 -Bin 3:29,29,29,29*Smoothing by bin boundaries:-Bin 1:4,4,4,15 -Bin 2:21,21,25,25 -Bin 3:26,26,26,3410/6/202214Cluster Analysis10/6
14、/202215Regressionxyy=x+1X1Y1Y110/6/202216Chapter 3:Data PreprocessingnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary10/6/202217Data IntegrationnData integration:ncombines data from multiple sources in
15、to a coherent storenSchema integrationnintegrate metadata from different sourcesnEntity identification problem:identify real world entities from multiple data sources,e.g.,A.cust-id B.cust-#nDetecting and resolving data value conflictsnfor the same real world entity,attribute values from different s
16、ources are differentnpossible reasons:different representations,different scales,e.g.,metric vs.British units10/6/202218Handling Redundant Data in Data IntegrationnRedundant data occur often when integration of multiple databasesnThe same attribute may have different names in different databasesnOne
17、 attribute may be a“derived”attribute in another table,e.g.,annual revenuenRedundant data may be able to be detected by correlational analysisnCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality10/6/202219D
18、ata TransformationnSmoothing:remove noise from datanAggregation:summarization,data cube constructionnGeneralization:concept hierarchy climbingnNormalization:scaled to fall within a small,specified rangenmin-max normalizationnz-score normalizationnnormalization by decimal scalingnAttribute/feature co
19、nstructionnNew attributes constructed from the given ones10/6/202220Data Transformation:Normalizationnmin-max normalizationnz-score normalizationnnormalization by decimal scalingAAAAAAminnewminnewmaxnewminmaxminvv_)_(AAdevstandmeanvv_jvv10Where j is the smallest integer such that Max(|)Reduced attri
20、bute set:A1,A4,A610/6/202226Heuristic Feature Selection MethodsnThere are 2d possible sub-features of d featuresnSeveral heuristic feature selection methods:nBest single features under the feature independence assumption:choose by significance tests.nBest step-wise feature selection:nThe best single
21、-feature is picked firstnThen next best feature condition to the first,.nStep-wise feature elimination:nRepeatedly eliminate the worst featurenBest combined feature selection and elimination:nOptimal branch and bound:nUse feature elimination and backtracking10/6/202227Data CompressionnString compres
22、sionnThere are extensive theories and well-tuned algorithmsnTypically losslessnBut only limited manipulation is possible without expansionnAudio/video compressionnTypically lossy compression,with progressive refinementnSometimes small fragments of signal can be reconstructed without reconstructing t
23、he wholenTime sequence is not audionTypically short and vary slowly with time10/6/202228Data CompressionOriginal DataCompressed DatalosslessOriginal DataApproximated lossy10/6/202229Wavelet Transforms nDiscrete wavelet transform(DWT):linear signal processing nCompressed approximation:store only a sm
24、all fraction of the strongest of the wavelet coefficientsnSimilar to discrete Fourier transform(DFT),but better lossy compression,localized in spacenMethod:nLength,L,must be an integer power of 2(padding with 0s,when necessary)nEach transform has 2 functions:smoothing,differencenApplies to pairs of
25、data,resulting in two set of data of length L/2nApplies two functions recursively,until reaches the desired length Haar2Daubechie410/6/202230nGiven N data vectors from k-dimensions,find c=k orthogonal vectors that can be best used to represent data nThe original data set is reduced to one consisting
26、 of N data vectors on c principal components(reduced dimensions)nEach data vector is a linear combination of the c principal component vectorsnWorks for numeric data onlynUsed when the number of dimensions is largePrincipal Component Analysis 10/6/202231X1X2Y1Y2Principal Component Analysis10/6/20223
27、2Numerosity ReductionnParametric methodsnAssume the data fits some model,estimate model parameters,store only the parameters,and discard the data(except possible outliers)nLog-linear models:obtain value at a point in m-D space as the product on appropriate marginal subspaces nNon-parametric methods
28、nDo not assume modelsnMajor families:histograms,clustering,sampling 10/6/202233Regression and Log-Linear ModelsnLinear regression:Data are modeled to fit a straight linenOften uses the least-square method to fit the linenMultiple regression:allows a response variable Y to be modeled as a linear func
29、tion of multidimensional feature vectornLog-linear model:approximates discrete multidimensional probability distributions10/6/202234nLinear regression:Y=+XnTwo parameters,and specify the line and are to be estimated by using the data at hand.nusing the least squares criterion to the known values of
30、Y1,Y2,X1,X2,.nMultiple regression:Y=b0+b1 X1+b2 X2.nMany nonlinear functions can be transformed into the above.nLog-linear models:nThe multi-way table of joint probabilities is approximated by a product of lower-order tables.nProbability:p(a,b,c,d)=ab acad bcdRegress Analysis and Log-Linear ModelsHi
31、stogramsnA popular data reduction techniquenDivide data into buckets and store average(sum)for each bucketnCan be constructed optimally in one dimension using dynamic programmingnRelated to quantization problems.051015202530354010000200003000040000500006000070000800009000010000010/6/202236Clustering
32、nPartition data set into clusters,and one can store cluster representation onlynCan be very effective if data is clustered but not if data is“smeared”nCan have hierarchical clustering and be stored in multi-dimensional index tree structuresnThere are many choices of clustering definitions and cluste
33、ring algorithms,further detailed in Chapter 810/6/202237SamplingnAllow a mining algorithm to run in complexity that is potentially sub-linear to the size of the datanChoose a representative subset of the datanSimple random sampling may have very poor performance in the presence of skewnDevelop adapt
34、ive sampling methodsnStratified sampling:nApproximate the percentage of each class(or subpopulation of interest)in the overall database nUsed in conjunction with skewed datanSampling may not reduce database I/Os(page at a time).10/6/202238SamplingSRSWOR(simple random sample without replacement)SRSWR
35、Raw Data10/6/202239SamplingRaw Data Cluster/Stratified Sample10/6/202240Hierarchical ReductionnUse multi-resolution structure with different degrees of reductionnHierarchical clustering is often performed but tends to define partitions of data sets rather than“clusters”nParametric methods are usuall
36、y not amenable to hierarchical representationnHierarchical aggregation nAn index tree hierarchically divides a data set into partitions by value range of some attributesnEach partition can be considered as a bucketnThus an index tree with aggregates stored at each node is a hierarchical histogram10/
37、6/202241Chapter 3:Data PreprocessingnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary10/6/202242DiscretizationnThree types of attributes:nNominal values from an unordered setnOrdinal values from an orde
38、red setnContinuous real numbersnDiscretization:*divide the range of a continuous attribute into intervalsSome classification algorithms only accept categorical attributes.Reduce data size by discretizationPrepare for further analysis10/6/202243Discretization and Concept hierachynDiscretization nredu
39、ce the number of values for a given continuous attribute by dividing the range of the attribute into intervals.Interval labels can then be used to replace actual data values.nConcept hierarchies nreduce the data by collecting and replacing low level concepts(such as numeric values for the attribute
40、age)by higher level concepts(such as young,middle-aged,or senior).10/6/202244Discretization and concept hierarchy generation for numeric datanBinning(see sections before)nHistogram analysis(see sections before)nClustering analysis(see sections before)nEntropy-based discretizationnSegmentation by nat
41、ural partitioning10/6/202245Entropy-Based DiscretizationnGiven a set of samples S,if S is partitioned into two intervals S1 and S2 using boundary T,the entropy after partitioning isnThe boundary that minimizes the entropy function over all possible boundaries is selected as a binary discretization.n
42、The process is recursively applied to partitions obtained until some stopping criterion is met,e.g.,nExperiments show that it may reduce data size and improve classification accuracyE S TSEntSEntSSSS(,)|()|()1122Ent SE T S()(,)10/6/202246Segmentation by natural partitioning3-4-5 rule can be used to
43、segment numeric data into relatively uniform,“natural”intervals.*If an interval covers 3,6,7 or 9 distinct values at the most significant digit,partition the range into 3 equi-width intervals*If it covers 2,4,or 8 distinct values at the most significant digit,partition the range into 4 intervals*If
44、it covers 1,5,or 10 distinct values at the most significant digit,partition the range into 5 intervals10/6/202247Example of 3-4-5 rule(-$4000-$5,000)(-$400-0)(-$400-$300)(-$300-$200)(-$200-$100)(-$100-0)(0-$1,000)(0-$200)($200-$400)($400-$600)($600-$800)($800-$1,000)($2,000-$5,000)($2,000-$3,000)($3
45、,000-$4,000)($4,000-$5,000)($1,000-$2,000)($1,000-$1,200)($1,200-$1,400)($1,400-$1,600)($1,600-$1,800)($1,800-$2,000)msd=1,000Low=-$1,000High=$2,000Step 2:Step 4:Step 1:-$351-$159profit$1,838$4,700 Min Low(i.e,5%-tile)High(i.e,95%-0 tile)Maxcount(-$1,000 -$2,000)(-$1,000-0)(0-$1,000)Step 3:($1,000-$
46、2,000)10/6/202248Concept hierarchy generation for categorical datanSpecification of a partial ordering of attributes explicitly at the schema level by users or expertsnSpecification of a portion of a hierarchy by explicit data groupingnSpecification of a set of attributes,but not of their partial or
47、deringnSpecification of only a partial set of attributes10/6/202249Specification of a set of attributesConcept hierarchy can be automatically generated based on the number of distinct values per attribute in the given attribute set.The attribute with the most distinct values is placed at the lowest
48、level of the hierarchy.countryprovince_or_ statecitystreet15 distinct values65 distinct values3567 distinct values674,339 distinct values10/6/202250Chapter 3:Data PreprocessingnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hiera
49、rchy generationnSummary10/6/202251SummarynData preparation is a big issue for both warehousing and miningnData preparation includesnData cleaning and data integrationnData reduction and feature selectionnDiscretizationnA lot a methods have been developed but still an active area of research10/6/2022
50、52ReferencesnD.P.Ballou and G.K.Tayi.Enhancing data quality in data warehouse environments.Communications of ACM,42:73-78,1999.nJagadish et al.,Special Issue on Data Reduction Techniques.Bulletin of the Technical Committee on Data Engineering,20(4),December 1997.nD.Pyle.Data Preparation for Data Min