1、 Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 1 Data Mining: Exploring DataLecture Notes for Chapter 3Introduction to Data MiningbyTan, Steinbach, Kumar Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 2 What is data exploration?lKey motivations of data exploration include He
2、lping to select the right tool for preprocessing or analysis Making use of humans abilities to recognize patternsu People can recognize patterns not captured by data analysis tools lRelated to the area of Exploratory Data Analysis (EDA) Created by statistician John Tukey Seminal book is Exploratory
3、Data Analysis by Tukey A nice online introduction can be found in Chapter 1 of the NIST Engineering Statistics Handbookhttp:/www.itl.nist.gov/div898/handbook/index.htmA preliminary exploration of the data to better understand its characteristics. Tan,Steinbach, Kumar Introduction to Data Mining 8/05
4、/2005 3 Techniques Used In Data Exploration lIn EDA, as originally defined by Tukey The focus was on visualization Clustering and anomaly detection were viewed as exploratory techniques In data mining, clustering and anomaly detection are major areas of interest, and not thought of as just explorato
5、rylIn our discussion of data exploration, we focus on Summary statistics Visualization Online Analytical Processing (OLAP) Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 4 Iris Sample Data Set lMany of the exploratory data techniques are illustrated with the Iris Plant data set. Can be o
6、btained from the UCI Machine Learning Repository http:/www.ics.uci.edu/mlearn/MLRepository.html From the statistician Douglas Fisher Three flower types (classes):u Setosau Virginica u Versicolour Four (non-class) attributesu Sepal width and lengthu Petal width and lengthVirginica. Robert H. Mohlenbr
7、ock. USDA NRCS. 1995. Northeast wetland flora: Field office guide to plant species. Northeast National Technical Center, Chester, PA. Courtesy of USDA NRCS Wetland Science Institute. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 5 Summary StatisticslSummary statistics are numbers that s
8、ummarize properties of the data Summarized properties include frequency, location and spreadu Examples: location - mean spread - standard deviation Most summary statistics can be calculated in a single pass through the data Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 6 Frequency and M
9、odelThe frequency of an attribute value is the percentage of time the value occurs in the data set For example, given the attribute gender and a representative population of people, the gender female occurs about 50% of the time.lThe mode of a an attribute is the most frequent attribute value lThe n
10、otions of frequency and mode are typically used with categorical data Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 7 PercentileslFor continuous data, the notion of a percentile is more useful. Given an ordinal or continuous attribute x and a number p between 0 and 100, the pth percenti
11、le is a value of x such that p% of the observed values of x are less than . lFor instance, the 50th percentile is the value such that 50% of all values of x are less than . Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 8 Measures of Location: Mean and MedianlThe mean is the most common
12、measure of the location of a set of points. lHowever, the mean is very sensitive to outliers. lThus, the median or a trimmed mean is also commonly used. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 9 Measures of Spread: Range and VariancelRange is the difference between the max and min
13、lThe variance or standard deviation is the most common measure of the spread of a set of points. lHowever, this is also sensitive to outliers, so that other measures are often used. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 10 Visualization Visualization is the conversion of data in
14、to a visual or tabular format so that the characteristics of the data and the relationships among data items or attributes can be analyzed or reported.lVisualization of data is one of the most powerful and appealing techniques for data exploration. Humans have a well developed ability to analyze lar
15、ge amounts of information that is presented visually Can detect general patterns and trends Can detect outliers and unusual patterns Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 11 Example: Sea Surface TemperaturelThe following shows the Sea Surface Temperature (SST) for July 1982 Tens
16、 of thousands of data points are summarized in a single figure Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 12 RepresentationlIs the mapping of information to a visual formatlData objects, their attributes, and the relationships among data objects are translated into graphical elements
17、 such as points, lines, shapes, and colors.lExample: Objects are often represented as points Their attribute values can be represented as the position of the points or the characteristics of the points, e.g., color, size, and shape If position is used, then the relationships of points, i.e., whether
18、 they form groups or a point is an outlier, is easily perceived. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 13 ArrangementlIs the placement of visual elements within a displaylCan make a large difference in how easy it is to understand the datalExample: Tan,Steinbach, Kumar Introduct
19、ion to Data Mining 8/05/2005 14 SelectionlIs the elimination or the de-emphasis of certain objects and attributeslSelection may involve the chossing a subset of attributes Dimensionality reduction is often used to reduce the number of dimensions to two or three Alternatively, pairs of attributes can
20、 be consideredlSelection may also involve choosing a subset of objects A region of the screen can only show so many points Can sample, but want to preserve points in sparse areas Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 15 Visualization Techniques: HistogramslHistogram Usually show
21、s the distribution of values of a single variable Divide the values into bins and show a bar plot of the number of objects in each bin. The height of each bar indicates the number of objects Shape of histogram depends on the number of binslExample: Petal Width (10 and 20 bins, respectively) Tan,Stei
22、nbach, Kumar Introduction to Data Mining 8/05/2005 16 Two-Dimensional HistogramslShow the joint distribution of the values of two attributes lExample: petal width and petal length What does this tell us? Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 17 Visualization Techniques: Box Plot
23、slBox Plots Invented by J. Tukey Another way of displaying the distribution of data Following figure shows the basic part of a box plotoutlier10th percentile25th percentile75th percentile50th percentile10th percentile Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 18 Example of Box Plots
24、 lBox plots can be used to compare attributes Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 19 Visualization Techniques: Scatter PlotslScatter plots Attributes values determine the position Two-dimensional scatter plots most common, but can have three-dimensional scatter plots Often add
25、itional attributes can be displayed by using the size, shape, and color of the markers that represent the objects It is useful to have arrays of scatter plots can compactly summarize the relationships of several pairs of attributesu See example on the next slide Tan,Steinbach, Kumar Introduction to
26、Data Mining 8/05/2005 20 Scatter Plot Array of Iris Attributes Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 21 Visualization Techniques: Contour PlotslContour plots Useful when a continuous attribute is measured on a spatial grid They partition the plane into regions of similar values
27、The contour lines that form the boundaries of these regions connect points with equal values The most common example is contour maps of elevation Can also display temperature, rainfall, air pressure, etc.uAn example for Sea Surface Temperature (SST) is provided on the next slide Tan,Steinbach, Kumar
28、 Introduction to Data Mining 8/05/2005 22 Contour Plot Example: SST Dec, 1998Celsius Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 23 Visualization Techniques: Matrix PlotslMatrix plots Can plot the data matrix This can be useful when objects are sorted according to class Typically, the
29、 attributes are normalized to prevent one attribute from dominating the plot Plots of similarity or distance matrices can also be useful for visualizing the relationships between objects Examples of matrix plots are presented on the next two slides Tan,Steinbach, Kumar Introduction to Data Mining 8/
30、05/2005 24 Visualization of the Iris Data Matrixstandarddeviation Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 25 Visualization of the Iris Correlation Matrix Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 26 Visualization Techniques: Parallel CoordinateslParallel Coordinat
31、es Used to plot the attribute values of high-dimensional data Instead of using perpendicular axes, use a set of parallel axes The attribute values of each object are plotted as a point on each corresponding coordinate axis and the points are connected by a line Thus, each object is represented as a
32、line Often, the lines representing a distinct class of objects group together, at least for some attributes Ordering of attributes is important in seeing such groupings Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 27 Parallel Coordinates Plots for Iris Data Tan,Steinbach, Kumar Introdu
33、ction to Data Mining 8/05/2005 28 Other Visualization TechniqueslStar Plots Similar approach to parallel coordinates, but axes radiate from a central point The line connecting the values of an object is a polygonlChernoff Faces Approach created by Herman Chernoff This approach associates each attrib
34、ute with a characteristic of a face The values of each attribute determine the appearance of the corresponding facial characteristic Each object becomes a separate face Relies on humans ability to distinguish faces Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 29 Star Plots for Iris Dat
35、aSetosaVersicolourVirginica Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 30 Chernoff Faces for Iris DataSetosaVersicolourVirginica Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 31 OLAPlOn-Line Analytical Processing (OLAP) was proposed by E. F. Codd, the father of the relat
36、ional database.lRelational databases put data into tables, while OLAP uses a multidimensional array representation. Such representations of data previously existed in statistics and other fieldslThere are a number of data analysis and data exploration operations that are easier with such a data repr
37、esentation. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 32 Creating a Multidimensional ArraylTwo key steps in converting tabular data into a multidimensional array. First, identify which attributes are to be the dimensions and which attribute is to be the target attribute whose values
38、 appear as entries in the multidimensional array.uThe attributes used as dimensions must have discrete valuesuThe target value is typically a count or continuous value, e.g., the cost of an itemuCan have no target variable at all except the count of objects that have the same set of attribute values
39、 Second, find the value of each entry in the multidimensional array by summing the values (of the target attribute) or count of all objects that have the attribute values corresponding to that entry. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 33 Example: Iris datalWe show how the att
40、ributes, petal length, petal width, and species type can be converted to a multidimensional array First, we discretized the petal width and length to have categorical values: low, medium, and high We get the following table - note the count attribute Tan,Steinbach, Kumar Introduction to Data Mining
41、8/05/2005 34 Example: Iris data (continued)lEach unique tuple of petal width, petal length, and species type identifies one element of the array.lThis element is assigned the corresponding count value. lThe figure illustrates the result.lAll non-specified tuples are 0. Tan,Steinbach, Kumar Introduct
42、ion to Data Mining 8/05/2005 35 Example: Iris data (continued)lSlices of the multidimensional array are shown by the following cross-tabulationslWhat do these tables tell us? Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 36 OLAP Operations: Data CubelThe key operation of a OLAP is the f
43、ormation of a data cubelA data cube is a multidimensional representation of data, together with all possible aggregates.lBy all possible aggregates, we mean the aggregates that result by selecting a proper subset of the dimensions and summing over all remaining dimensions.lFor example, if we choose
44、the species type dimension of the Iris data and sum over all other dimensions, the result will be a one-dimensional entry with three entries, each of which gives the number of flowers of each type. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 37 lConsider a data set that records the sa
45、les of products at a number of company stores at various dates.lThis data can be represented as a 3 dimensional arraylThere are 3 two-dimensionalaggregates (3 choose 2 ),3 one-dimensional aggregates,and 1 zero-dimensional aggregate (the overall total)Data Cube Example Tan,Steinbach, Kumar Introducti
46、on to Data Mining 8/05/2005 38 lThe following figure table shows one of the two dimensional aggregates, along with two of the one-dimensional aggregates, and the overall totalData Cube Example (continued) Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 39 OLAP Operations: Slicing and Dici
47、nglSlicing is selecting a group of cells from the entire multidimensional array by specifying a specific value for one or more dimensions. lDicing involves selecting a subset of cells by specifying a range of attribute values. This is equivalent to defining a subarray from the complete array. lIn pr
48、actice, both operations can also be accompanied by aggregation over some dimensions. Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 40 OLAP Operations: Roll-up and Drill-downlAttribute values often have a hierarchical structure. Each date is associated with a year, month, and week. A loc
49、ation is associated with a continent, country, state (province, etc.), and city. Products can be divided into various categories, such as clothing, electronics, and furniture.lNote that these categories often nest and form a tree or lattice A year contains months which contains day A country contain
50、s a state which contains a city Tan,Steinbach, Kumar Introduction to Data Mining 8/05/2005 41 OLAP Operations: Roll-up and Drill-downlThis hierarchical structure gives rise to the roll-up and drill-down operations. For sales data, we can aggregate (roll up) the sales across all the dates in a month.