1、 Research&Development about Data Mining 2022年8月12日星期五 1 What is Data Mining?数据挖掘概论数据挖掘概论南京航空航天大学南京航空航天大学信息科学与技术学院信息科学与技术学院皮德常皮德常 教授、博导教授、博导 Research&Development about Data Mining 2022年8月12日星期五 2 lLots of data is being collected and warehoused Web data,e-commerce purchases at department/grocery store
2、s Bank/Credit Card transactionslComputers have become cheaper and more powerfullCompetitive pressure is strong Provide better,customized services for an edge(e.g.in Customer Relationship Management)Why Mine Data?Commercial ViewpointWhy Mine Data?Scientific ViewpointlData collected and stored at enor
3、mous speeds(GB/hour)remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of datalTraditional techniques infeasible for raw datalData mining may help scientists in classifying and segmenting data,Research&D
4、evelopment about Data Mining 2022年8月12日星期五 4 Mining Large Data Sets-Motivationldata rich but information poor!-we are drowning in data,but starving for knowledge!哇!这么多的数据!哇!这么多的数据!怎样才能用呢怎样才能用呢?挖!挖!“Necessity is the mother of invention”Data miningAutomated analysis of massive data sets Research&Devel
5、opment about Data Mining 2022年8月12日星期五 5 Mining Large Data Sets-MotivationlA famous story:跟尿布一起购买最多的商品是啤酒!跟尿布一起购买最多的商品是啤酒!diapersbeer Research&Development about Data Mining 2022年8月12日星期五 6 The success of GoogleSearch Engine:Analyzing data on the internet to find what meets your demand.Larry Page 197
6、3.3.26&Sergey Brin 1973.8.21 166亿美元亿美元&141亿美元的财产,共享一架波音亿美元的财产,共享一架波音767 Research&Development about Data Mining 2022年8月12日星期五 7 What is Data Mining?lData mining is the non-trivial process of identifying valid,novel,potentially useful,and ultimately understandable patterns from huge volume of data.U.F
7、ayyad,et al.s definition of KDD at KDD96 Research&Development about Data Mining 2022年8月12日星期五 8 What is(not)Data Mining?l What is Data Mining?Certain names are more prevalent in certain US locations(OBrien,ORurke,OReilly in Boston area)l What is not Data Mining?Look up phone number in phone director
8、y Research&Development about Data Mining 2022年8月12日星期五 9 lDraws ideas from machine learning/AI,pattern recognition,statistics,and database systemslTraditional Techniquesmay be unsuitable due to Enormity of data High dimensionality of data Heterogeneous,distributed nature of dataOrigins of Data Minin
9、gMachine Learning/Pattern RecognitionStatistics/AIData MiningDatabase systems Research&Development about Data Mining 2022年8月12日星期五 10 Architecture:Typical Data Mining Systemdata cleaning,integration,and selectionDatabase or Data Warehouse ServerData Mining EnginePattern EvaluationGraphical User Inte
10、rfaceKnowle-dgeBaseDBDWWWWOther InfoRepositories Research&Development about Data Mining 2022年8月12日星期五 11 Data Mining TaskslPrediction Use some variables to predict unknown or future values of other variables.lDescription Find human-interpretable patterns that describe the data.From Fayyad,et.al.Adva
11、nces in Knowledge Discovery and Data Mining,1996 Research&Development about Data Mining 2022年8月12日星期五 12 Data Mining Tasks.lClassificationlClusteringlAssociation Rule DiscoverylSequential Pattern DiscoverylRegressionlDeviation Detection Research&Development about Data Mining 2022年8月12日星期五 13 Classif
12、ication ExampleTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassRefundMaritalStatusTaxableIncomeCheatNoSingle75
13、K?YesMarried50K?NoMarried150K?YesDivorced90K?NoSingle40K?NoMarried80K?10TestSetTraining SetModelLearn Classifier Research&Development about Data Mining 2022年8月12日星期五 14 Classification:ApplicationlDirect Marketing Goal:Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-ph
14、one product.Approach:uUse the data for a similar product introduced before.uWe know which customers decided to buy and which decided otherwise.This buy,dont buy decision forms the class attribute.uCollect some related information about the customers.Type of business,where they stay,how much they ear
15、n,etc.uUse this information as input attributes to learn a classifier model.Research&Development about Data Mining 2022年8月12日星期五 15 Clustering DefinitionlGiven a set of data points,each having a set of attributes,and a similarity measure among them,find clusters such that Data points in one cluster
16、are more similar to one another.Data points in separate clusters are less similar to one another.Research&Development about Data Mining 2022年8月12日星期五 16 ClusteringxEuclidean Distance Based Clustering in 3-D space.Intra-cluster distancesare minimizedInter-cluster distancesare maximized Research&Devel
17、opment about Data Mining 2022年8月12日星期五 17 Clustering:ApplicationlDocument Clustering:Goal:To find groups of documents that are similar to each other based on the important terms appearing in them.Approach:To identify frequently occurring terms in each document.Form a similarity measure based on the
18、frequencies of different terms.Use it to cluster.Gain:Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.Research&Development about Data Mining 2022年8月12日星期五 18 Illustrating Document ClusteringlClustering Points:3204 Articles of Los Angeles
19、Times.lSimilarity Measure:How many words are common in these documents(after some word filtering).CategoryTotalArticlesCorrectlyPlacedFinancial555364Foreign341260National27336Metro943746Sports738573Entertainment354278 Research&Development about Data Mining 2022年8月12日星期五 19 Association Rule Discovery
20、lGiven a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.TIDItems1Bread,Coke,Milk2Beer,Bread3Beer,Coke,Diaper,Milk4Beer,Bread,Diaper,Milk5Coke,Diaper,MilkRules Disc
21、overed:Diaper,Milk-Beer Research&Development about Data Mining 2022年8月12日星期五 20 Association Rule Discovery:Application 1lSupermarket shelf management.Goal:To identify items that are bought together by sufficiently many customers.Approach:Process the point-of-sale data collected with barcode scanners
22、 to find dependencies among items.A classic rule uIf a customer buys diaper and milk,then he is very likely to buy beer.lSo,dont be surprised if you find six-packs stacked next to diapers!Research&Development about Data Mining 2022年8月12日星期五 21 RegressionlPredict a value of a given continuous valued
23、variable based on the values of other variables,assuming a linear or nonlinear model of dependency.lGreatly studied in statistics,neural network fields.lExamples:Predicting sales amounts of new product based on advetising expenditure.Predicting wind velocities as a function of temperature,humidity,a
24、ir pressure,etc.Time series prediction of stock market indices.Research&Development about Data Mining 2022年8月12日星期五 22 Deviation/Anomaly DetectionlDetect significant deviations from normal behaviorlApplications:Credit Card Fraud Detection Network Intrusion Detection Research&Development about Data M
25、ining 2022年8月12日星期五 23 Challenges of Data MininglScalabilitylDimensionalitylComplex and Heterogeneous DatalData QualitylData Ownership and DistributionlPrivacy PreservationlStreaming Data Research&Development about Data Mining 2022年8月12日星期五 24 My hopel数据挖掘研究已经开展了近数据挖掘研究已经开展了近15年。推进该技术的广泛应用:年。推进该技术的广
26、泛应用:1.企业界已经开始关注数据挖掘技术企业界已经开始关注数据挖掘技术u研究部门应该做什么?研究部门应该做什么?2.自身技术的研究:自身技术的研究:u易用性易用性u可用性可用性3.与应用领域的结合:与应用领域的结合:u金融业金融业u生物信息学生物信息学u信息检索。信息检索。u飞行器故障诊断与预测、可靠性、飞行器故障诊断与预测、可靠性、Research&Development about Data Mining 2022年8月12日星期五 25 My research in recent years1.Mining Acceleration-like Association Rule2.Int
27、erior-oriented Intrusion Detection System Based on Multi-agents 3.Fuzzy Clustering Algorithm4.A Fast Trajectory Clustering Algorithm with Sampling Research&Development about Data Mining 2022年8月12日星期五 26 My research in recent years5.An improved C-means clustering algorithm:employs the theory of gravi
28、ty to distribute the instances.Research&Development about Data Mining 2022年8月12日星期五 27 My research in recent years6.A Neighborhood-Based Trajectory Clustering Algorithm:Our key insight is that neighborhood-based local density is quite different from the absolute global density used in TRACLUS.(a)TRA
29、CLUSs result for Deer95 (b)NBTCs result for Deer95 Research&Development about Data Mining 2022年8月12日星期五 28 My research in recent years7.Unifying Density-Based Clustering and Outlier Detection:to discover density-based clusters and assign to each density-based outlier a degree of being an outlier.Research&Development about Data Mining 2022年8月12日星期五 29 My research in recent years8.DM的应用的应用A.信息系统:数据库与信息系统安全、软件安全;信息系统:数据库与信息系统安全、软件安全;B.航空航天:可靠性分析、故障检测与预测;航空航天:可靠性分析、故障检测与预测;C.电子元器件:故障诊断,电子元器件:故障诊断,Research&Development about Data Mining 2022年8月12日星期五 30 谢谢谢谢!Questions?