1、8/12/2022Data Mining:Concepts and Techniques1Data Mining:Concepts and Techniques Slides for Textbook Chapter 3 Jiawei Han and Micheline KamberDepartment of Computer Science University of Illinois at Urbana-Champaigncs.uiuc.edu/hanj8/12/2022Data Mining:Concepts and Techniques2Chapter 3:Data Preproces
2、singnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary8/12/2022Data Mining:Concepts and Techniques3Why Data Preprocessing?nData in the real world is dirtynincomplete:lacking attribute values,lacking cert
3、ain attributes of interest,or containing only aggregate datane.g.,occupation=“”nnoisy:containing errors or outliersne.g.,Salary=“-10”ninconsistent:containing discrepancies in codes or namesne.g.,Age=“42”Birthday=“03/07/2019”ne.g.,Was rating“1,2,3”,now rating“A,B,C”ne.g.,discrepancy between duplicate
4、 records8/12/2022Data Mining:Concepts and Techniques4Why Is Data Dirty?nIncomplete data comes fromnn/a data value when collectedndifferent consideration between the time when the data was collected and when it is analyzed.nhuman/hardware/software problemsnNoisy data comes from the process of datanco
5、llectionnentryntransmissionnInconsistent data comes fromnDifferent data sourcesnFunctional dependency violation8/12/2022Data Mining:Concepts and Techniques5Why Is Data Preprocessing Important?nNo quality data,no quality mining results!nQuality decisions must be based on quality datane.g.,duplicate o
6、r missing data may cause incorrect or even misleading statistics.nData warehouse needs consistent integration of quality datanData extraction,cleaning,and transformation comprises the majority of the work of building a data warehouse.Bill Inmon 8/12/2022Data Mining:Concepts and Techniques6Multi-Dime
7、nsional Measure of Data QualitynA well-accepted multidimensional view:nAccuracynCompletenessnConsistencynTimelinessnBelievabilitynValue addednInterpretabilitynAccessibilitynBroad categories:nintrinsic,contextual,representational,and accessibility.8/12/2022Data Mining:Concepts and Techniques7Major Ta
8、sks in Data PreprocessingnData cleaningnFill in missing values,smooth noisy data,identify or remove outliers,and resolve inconsistenciesnData integrationnIntegration of multiple databases,data cubes,or filesnData transformationnNormalization and aggregationnData reductionnObtains reduced representat
9、ion in volume but produces the same or similar analytical resultsnData discretizationnPart of data reduction but with particular importance,especially for numerical data8/12/2022Data Mining:Concepts and Techniques8Forms of data preprocessing 8/12/2022Data Mining:Concepts and Techniques9Chapter 3:Dat
10、a PreprocessingnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary8/12/2022Data Mining:Concepts and Techniques10Data CleaningnImportancen“Data cleaning is one of the three biggest problems in data warehou
11、sing”Ralph Kimballn“Data cleaning is the number one problem in data warehousing”DCI surveynData cleaning tasksnFill in missing valuesnIdentify outliers and smooth out noisy data nCorrect inconsistent datanResolve redundancy caused by data integration8/12/2022Data Mining:Concepts and Techniques11Miss
12、ing DatanData is not always availablenE.g.,many tuples have no recorded value for several attributes,such as customer income in sales datanMissing data may be due to nequipment malfunctionninconsistent with other recorded data and thus deletedndata not entered due to misunderstandingncertain data ma
13、y not be considered important at the time of entrynnot register history or changes of the datanMissing data may need to be inferred.8/12/2022Data Mining:Concepts and Techniques12How to Handle Missing Data?nIgnore the tuple:usually done when class label is missing(assuming the tasks in classification
14、not effective when the percentage of missing values per attribute varies considerably.nFill in the missing value manually:tedious+infeasible?nFill in it automatically withna global constant:e.g.,“unknown”,a new class?!nthe attribute meannthe attribute mean for all samples belonging to the same class
15、:smarternthe most probable value:inference-based such as Bayesian formula or decision tree8/12/2022Data Mining:Concepts and Techniques13Noisy DatanNoise:random error or variance in a measured variablenIncorrect attribute values may due tonfaulty data collection instrumentsndata entry problemsndata t
16、ransmission problemsntechnology limitationninconsistency in naming convention nOther data problems which requires data cleaningnduplicate recordsnincomplete dataninconsistent data8/12/2022Data Mining:Concepts and Techniques14How to Handle Noisy Data?nBinning method:nfirst sort data and partition int
17、o(equi-depth)binsnthen one can smooth by bin means,smooth by bin median,smooth by bin boundaries,etc.nClusteringndetect and remove outliersnCombined computer and human inspectionndetect suspicious values and check by human(e.g.,deal with possible outliers)nRegressionnsmooth by fitting the data into
18、regression functions8/12/2022Data Mining:Concepts and Techniques15Simple Discretization Methods:BinningnEqual-width(distance)partitioning:nDivides the range into N intervals of equal size:uniform gridnif A and B are the lowest and highest values of the attribute,the width of intervals will be:W=(B A
19、)/N.nThe most straightforward,but outliers may dominate presentationnSkewed data is not handled well.nEqual-depth(frequency)partitioning:nDivides the range into N intervals,each containing approximately same number of samplesnGood data scalingnManaging categorical attributes can be tricky.8/12/2022D
20、ata Mining:Concepts and Techniques16Binning Methods for Data Smoothing*Sorted data for price(in dollars):4,8,9,15,21,21,24,25,26,28,29,34*Partition into(equi-depth)bins:-Bin 1:4,8,9,15 -Bin 2:21,21,24,25 -Bin 3:26,28,29,34*Smoothing by bin means:-Bin 1:9,9,9,9 -Bin 2:23,23,23,23 -Bin 3:29,29,29,29*S
21、moothing by bin boundaries:-Bin 1:4,4,4,15 -Bin 2:21,21,25,25 -Bin 3:26,26,26,348/12/2022Data Mining:Concepts and Techniques17Cluster Analysis8/12/2022Data Mining:Concepts and Techniques18Regressionxyy=x+1X1Y1Y18/12/2022Data Mining:Concepts and Techniques19Chapter 3:Data PreprocessingnWhy preprocess
22、 the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hierarchy generationnSummary8/12/2022Data Mining:Concepts and Techniques20Data IntegrationnData integration:ncombines data from multiple sources into a coherent storenSchema integrationnintegrate
23、metadata from different sourcesnEntity identification problem:identify real world entities from multiple data sources,e.g.,A.cust-id B.cust-#nDetecting and resolving data value conflictsnfor the same real world entity,attribute values from different sources are differentnpossible reasons:different r
24、epresentations,different scales,e.g.,metric vs.British units8/12/2022Data Mining:Concepts and Techniques21Handling Redundancy in Data IntegrationnRedundant data occur often when integration of multiple databasesnThe same attribute may have different names in different databasesnOne attribute may be
25、a“derived”attribute in another table,e.g.,annual revenuenRedundant data may be able to be detected by correlational analysisnCareful integration of the data from multiple sources may help reduce/avoid redundancies and inconsistencies and improve mining speed and quality8/12/2022Data Mining:Concepts
26、and Techniques22Data TransformationnSmoothing:remove noise from datanAggregation:summarization,data cube constructionnGeneralization:concept hierarchy climbingnNormalization:scaled to fall within a small,specified rangenmin-max normalizationnz-score normalizationnnormalization by decimal scalingnAtt
27、ribute/feature constructionnNew attributes constructed from the given ones8/12/2022Data Mining:Concepts and Techniques23Data Transformation:Normalizationnmin-max normalizationnz-score normalizationnnormalization by decimal scalingAAAAAAminnewminnewmaxnewminmaxminvv_)_(AAdevstandmeanvv_jvv10Where j i
28、s the smallest integer such that Max(|)Reduced attribute set:A1,A4,A68/12/2022Data Mining:Concepts and Techniques29Heuristic Feature Selection MethodsnThere are 2d possible sub-features of d featuresnSeveral heuristic feature selection methods:nBest single features under the feature independence ass
29、umption:choose by significance tests.nBest step-wise feature selection:nThe best single-feature is picked firstnThen next best feature condition to the first,.nStep-wise feature elimination:nRepeatedly eliminate the worst featurenBest combined feature selection and elimination:nOptimal branch and bo
30、und:nUse feature elimination and backtracking8/12/2022Data Mining:Concepts and Techniques30Data CompressionnString compressionnThere are extensive theories and well-tuned algorithmsnTypically losslessnBut only limited manipulation is possible without expansionnAudio/video compressionnTypically lossy
31、 compression,with progressive refinementnSometimes small fragments of signal can be reconstructed without reconstructing the wholenTime sequence is not audionTypically short and vary slowly with time8/12/2022Data Mining:Concepts and Techniques31Data CompressionOriginal DataCompressed DatalosslessOri
32、ginal DataApproximated lossy8/12/2022Data Mining:Concepts and Techniques32Wavelet Transformation nDiscrete wavelet transform(DWT):linear signal processing,multiresolutional analysisnCompressed approximation:store only a small fraction of the strongest of the wavelet coefficientsnSimilar to discrete
33、Fourier transform(DFT),but better lossy compression,localized in spacenMethod:nLength,L,must be an integer power of 2(padding with 0s,when necessary)nEach transform has 2 functions:smoothing,differencenApplies to pairs of data,resulting in two set of data of length L/2nApplies two functions recursiv
34、ely,until reaches the desired length Haar2Daubechie48/12/2022Data Mining:Concepts and Techniques33DWT for Image CompressionnImage Low Pass High Pass Low Pass High PassLow Pass High Pass8/12/2022Data Mining:Concepts and Techniques34nGiven N data vectors from k-dimensions,find c=k orthogonal vectors t
35、hat can be best used to represent data nThe original data set is reduced to one consisting of N data vectors on c principal components(reduced dimensions)nEach data vector is a linear combination of the c principal component vectorsnWorks for numeric data onlynUsed when the number of dimensions is l
36、argePrincipal Component Analysis 8/12/2022Data Mining:Concepts and Techniques35X1X2Y1Y2Principal Component Analysis8/12/2022Data Mining:Concepts and Techniques36Numerosity ReductionnParametric methodsnAssume the data fits some model,estimate model parameters,store only the parameters,and discard the
37、 data(except possible outliers)nLog-linear models:obtain value at a point in m-D space as the product on appropriate marginal subspaces nNon-parametric methods nDo not assume modelsnMajor families:histograms,clustering,sampling 8/12/2022Data Mining:Concepts and Techniques37Regression and Log-Linear
38、ModelsnLinear regression:Data are modeled to fit a straight linenOften uses the least-square method to fit the linenMultiple regression:allows a response variable Y to be modeled as a linear function of multidimensional feature vectornLog-linear model:approximates discrete multidimensional probabili
39、ty distributionsnLinear regression:Y=+XnTwo parameters,and specify the line and are to be estimated by using the data at hand.nusing the least squares criterion to the known values of Y1,Y2,X1,X2,.nMultiple regression:Y=b0+b1 X1+b2 X2.nMany nonlinear functions can be transformed into the above.nLog-
40、linear models:nThe multi-way table of joint probabilities is approximated by a product of lower-order tables.nProbability:p(a,b,c,d)=ab acad bcdRegress Analysis and Log-Linear Models8/12/2022Data Mining:Concepts and Techniques39HistogramsnA popular data reduction techniquenDivide data into buckets a
41、nd store average(sum)for each bucketnCan be constructed optimally in one dimension using dynamic programmingnRelated to quantization problems.05101520253035401000020000300004000050000600007000080000900001000008/12/2022Data Mining:Concepts and Techniques40ClusteringnPartition data set into clusters,a
42、nd one can store cluster representation onlynCan be very effective if data is clustered but not if data is“smeared”nCan have hierarchical clustering and be stored in multi-dimensional index tree structuresnThere are many choices of clustering definitions and clustering algorithms,further detailed in
43、 Chapter 88/12/2022Data Mining:Concepts and Techniques41SamplingnAllow a mining algorithm to run in complexity that is potentially sub-linear to the size of the datanChoose a representative subset of the datanSimple random sampling may have very poor performance in the presence of skewnDevelop adapt
44、ive sampling methodsnStratified sampling:nApproximate the percentage of each class(or subpopulation of interest)in the overall database nUsed in conjunction with skewed datanSampling may not reduce database I/Os(page at a time).8/12/2022Data Mining:Concepts and Techniques42SamplingSRSWOR(simple rand
45、om sample without replacement)SRSWRRaw Data8/12/2022Data Mining:Concepts and Techniques43SamplingRaw Data Cluster/Stratified Sample8/12/2022Data Mining:Concepts and Techniques44Hierarchical ReductionnUse multi-resolution structure with different degrees of reductionnHierarchical clustering is often
46、performed but tends to define partitions of data sets rather than“clusters”nParametric methods are usually not amenable to hierarchical representationnHierarchical aggregation nAn index tree hierarchically divides a data set into partitions by value range of some attributesnEach partition can be con
47、sidered as a bucketnThus an index tree with aggregates stored at each node is a hierarchical histogram8/12/2022Data Mining:Concepts and Techniques45Chapter 3:Data PreprocessingnWhy preprocess the data?nData cleaning nData integration and transformationnData reductionnDiscretization and concept hiera
48、rchy generationnSummary8/12/2022Data Mining:Concepts and Techniques46DiscretizationnThree types of attributes:nNominal values from an unordered setnOrdinal values from an ordered setnContinuous real numbersnDiscretization:ndivide the range of a continuous attribute into intervalsnSome classification
49、 algorithms only accept categorical attributes.nReduce data size by discretizationnPrepare for further analysis8/12/2022Data Mining:Concepts and Techniques47Discretization and Concept hierachynDiscretization nreduce the number of values for a given continuous attribute by dividing the range of the a
50、ttribute into intervals.Interval labels can then be used to replace actual data valuesnConcept hierarchies nreduce the data by collecting and replacing low level concepts(such as numeric values for the attribute age)by higher level concepts(such as young,middle-aged,or senior)8/12/2022Data Mining:Co