数据挖掘概论.课件.ppt

上传人(卖家):三亚风情 文档编号:3251717 上传时间:2022-08-13 格式:PPT 页数:30 大小:4.15MB
下载 相关 举报
数据挖掘概论.课件.ppt_第1页
第1页 / 共30页
数据挖掘概论.课件.ppt_第2页
第2页 / 共30页
数据挖掘概论.课件.ppt_第3页
第3页 / 共30页
数据挖掘概论.课件.ppt_第4页
第4页 / 共30页
数据挖掘概论.课件.ppt_第5页
第5页 / 共30页
点击查看更多>>
资源描述

1、 Research&Development about Data Mining 2022年8月12日星期五 1 What is Data Mining?数据挖掘概论数据挖掘概论南京航空航天大学南京航空航天大学信息科学与技术学院信息科学与技术学院皮德常皮德常 教授、博导教授、博导 Research&Development about Data Mining 2022年8月12日星期五 2 lLots of data is being collected and warehoused Web data,e-commerce purchases at department/grocery store

2、s Bank/Credit Card transactionslComputers have become cheaper and more powerfullCompetitive pressure is strong Provide better,customized services for an edge(e.g.in Customer Relationship Management)Why Mine Data?Commercial ViewpointWhy Mine Data?Scientific ViewpointlData collected and stored at enor

3、mous speeds(GB/hour)remote sensors on a satellite telescopes scanning the skies microarrays generating gene expression data scientific simulations generating terabytes of datalTraditional techniques infeasible for raw datalData mining may help scientists in classifying and segmenting data,Research&D

4、evelopment about Data Mining 2022年8月12日星期五 4 Mining Large Data Sets-Motivationldata rich but information poor!-we are drowning in data,but starving for knowledge!哇!这么多的数据!哇!这么多的数据!怎样才能用呢怎样才能用呢?挖!挖!“Necessity is the mother of invention”Data miningAutomated analysis of massive data sets Research&Devel

5、opment about Data Mining 2022年8月12日星期五 5 Mining Large Data Sets-MotivationlA famous story:跟尿布一起购买最多的商品是啤酒!跟尿布一起购买最多的商品是啤酒!diapersbeer Research&Development about Data Mining 2022年8月12日星期五 6 The success of GoogleSearch Engine:Analyzing data on the internet to find what meets your demand.Larry Page 197

6、3.3.26&Sergey Brin 1973.8.21 166亿美元亿美元&141亿美元的财产,共享一架波音亿美元的财产,共享一架波音767 Research&Development about Data Mining 2022年8月12日星期五 7 What is Data Mining?lData mining is the non-trivial process of identifying valid,novel,potentially useful,and ultimately understandable patterns from huge volume of data.U.F

7、ayyad,et al.s definition of KDD at KDD96 Research&Development about Data Mining 2022年8月12日星期五 8 What is(not)Data Mining?l What is Data Mining?Certain names are more prevalent in certain US locations(OBrien,ORurke,OReilly in Boston area)l What is not Data Mining?Look up phone number in phone director

8、y Research&Development about Data Mining 2022年8月12日星期五 9 lDraws ideas from machine learning/AI,pattern recognition,statistics,and database systemslTraditional Techniquesmay be unsuitable due to Enormity of data High dimensionality of data Heterogeneous,distributed nature of dataOrigins of Data Minin

9、gMachine Learning/Pattern RecognitionStatistics/AIData MiningDatabase systems Research&Development about Data Mining 2022年8月12日星期五 10 Architecture:Typical Data Mining Systemdata cleaning,integration,and selectionDatabase or Data Warehouse ServerData Mining EnginePattern EvaluationGraphical User Inte

10、rfaceKnowle-dgeBaseDBDWWWWOther InfoRepositories Research&Development about Data Mining 2022年8月12日星期五 11 Data Mining TaskslPrediction Use some variables to predict unknown or future values of other variables.lDescription Find human-interpretable patterns that describe the data.From Fayyad,et.al.Adva

11、nces in Knowledge Discovery and Data Mining,1996 Research&Development about Data Mining 2022年8月12日星期五 12 Data Mining Tasks.lClassificationlClusteringlAssociation Rule DiscoverylSequential Pattern DiscoverylRegressionlDeviation Detection Research&Development about Data Mining 2022年8月12日星期五 13 Classif

12、ication ExampleTidRefundMaritalStatusTaxableIncomeCheat1YesSingle125KNo2NoMarried100KNo3NoSingle70KNo4YesMarried120KNo5NoDivorced95KYes6NoMarried60KNo7YesDivorced220KNo8NoSingle85KYes9NoMarried75KNo10NoSingle90KYes10categoricalcategoricalcontinuousclassRefundMaritalStatusTaxableIncomeCheatNoSingle75

13、K?YesMarried50K?NoMarried150K?YesDivorced90K?NoSingle40K?NoMarried80K?10TestSetTraining SetModelLearn Classifier Research&Development about Data Mining 2022年8月12日星期五 14 Classification:ApplicationlDirect Marketing Goal:Reduce cost of mailing by targeting a set of consumers likely to buy a new cell-ph

14、one product.Approach:uUse the data for a similar product introduced before.uWe know which customers decided to buy and which decided otherwise.This buy,dont buy decision forms the class attribute.uCollect some related information about the customers.Type of business,where they stay,how much they ear

15、n,etc.uUse this information as input attributes to learn a classifier model.Research&Development about Data Mining 2022年8月12日星期五 15 Clustering DefinitionlGiven a set of data points,each having a set of attributes,and a similarity measure among them,find clusters such that Data points in one cluster

16、are more similar to one another.Data points in separate clusters are less similar to one another.Research&Development about Data Mining 2022年8月12日星期五 16 ClusteringxEuclidean Distance Based Clustering in 3-D space.Intra-cluster distancesare minimizedInter-cluster distancesare maximized Research&Devel

17、opment about Data Mining 2022年8月12日星期五 17 Clustering:ApplicationlDocument Clustering:Goal:To find groups of documents that are similar to each other based on the important terms appearing in them.Approach:To identify frequently occurring terms in each document.Form a similarity measure based on the

18、frequencies of different terms.Use it to cluster.Gain:Information Retrieval can utilize the clusters to relate a new document or search term to clustered documents.Research&Development about Data Mining 2022年8月12日星期五 18 Illustrating Document ClusteringlClustering Points:3204 Articles of Los Angeles

19、Times.lSimilarity Measure:How many words are common in these documents(after some word filtering).CategoryTotalArticlesCorrectlyPlacedFinancial555364Foreign341260National27336Metro943746Sports738573Entertainment354278 Research&Development about Data Mining 2022年8月12日星期五 19 Association Rule Discovery

20、lGiven a set of records each of which contain some number of items from a given collection;Produce dependency rules which will predict occurrence of an item based on occurrences of other items.TIDItems1Bread,Coke,Milk2Beer,Bread3Beer,Coke,Diaper,Milk4Beer,Bread,Diaper,Milk5Coke,Diaper,MilkRules Disc

21、overed:Diaper,Milk-Beer Research&Development about Data Mining 2022年8月12日星期五 20 Association Rule Discovery:Application 1lSupermarket shelf management.Goal:To identify items that are bought together by sufficiently many customers.Approach:Process the point-of-sale data collected with barcode scanners

22、 to find dependencies among items.A classic rule uIf a customer buys diaper and milk,then he is very likely to buy beer.lSo,dont be surprised if you find six-packs stacked next to diapers!Research&Development about Data Mining 2022年8月12日星期五 21 RegressionlPredict a value of a given continuous valued

23、variable based on the values of other variables,assuming a linear or nonlinear model of dependency.lGreatly studied in statistics,neural network fields.lExamples:Predicting sales amounts of new product based on advetising expenditure.Predicting wind velocities as a function of temperature,humidity,a

24、ir pressure,etc.Time series prediction of stock market indices.Research&Development about Data Mining 2022年8月12日星期五 22 Deviation/Anomaly DetectionlDetect significant deviations from normal behaviorlApplications:Credit Card Fraud Detection Network Intrusion Detection Research&Development about Data M

25、ining 2022年8月12日星期五 23 Challenges of Data MininglScalabilitylDimensionalitylComplex and Heterogeneous DatalData QualitylData Ownership and DistributionlPrivacy PreservationlStreaming Data Research&Development about Data Mining 2022年8月12日星期五 24 My hopel数据挖掘研究已经开展了近数据挖掘研究已经开展了近15年。推进该技术的广泛应用:年。推进该技术的广

26、泛应用:1.企业界已经开始关注数据挖掘技术企业界已经开始关注数据挖掘技术u研究部门应该做什么?研究部门应该做什么?2.自身技术的研究:自身技术的研究:u易用性易用性u可用性可用性3.与应用领域的结合:与应用领域的结合:u金融业金融业u生物信息学生物信息学u信息检索。信息检索。u飞行器故障诊断与预测、可靠性、飞行器故障诊断与预测、可靠性、Research&Development about Data Mining 2022年8月12日星期五 25 My research in recent years1.Mining Acceleration-like Association Rule2.Int

27、erior-oriented Intrusion Detection System Based on Multi-agents 3.Fuzzy Clustering Algorithm4.A Fast Trajectory Clustering Algorithm with Sampling Research&Development about Data Mining 2022年8月12日星期五 26 My research in recent years5.An improved C-means clustering algorithm:employs the theory of gravi

28、ty to distribute the instances.Research&Development about Data Mining 2022年8月12日星期五 27 My research in recent years6.A Neighborhood-Based Trajectory Clustering Algorithm:Our key insight is that neighborhood-based local density is quite different from the absolute global density used in TRACLUS.(a)TRA

29、CLUSs result for Deer95 (b)NBTCs result for Deer95 Research&Development about Data Mining 2022年8月12日星期五 28 My research in recent years7.Unifying Density-Based Clustering and Outlier Detection:to discover density-based clusters and assign to each density-based outlier a degree of being an outlier.Research&Development about Data Mining 2022年8月12日星期五 29 My research in recent years8.DM的应用的应用A.信息系统:数据库与信息系统安全、软件安全;信息系统:数据库与信息系统安全、软件安全;B.航空航天:可靠性分析、故障检测与预测;航空航天:可靠性分析、故障检测与预测;C.电子元器件:故障诊断,电子元器件:故障诊断,Research&Development about Data Mining 2022年8月12日星期五 30 谢谢谢谢!Questions?

展开阅读全文
相关资源
猜你喜欢
相关搜索

当前位置:首页 > 办公、行业 > 各类PPT课件(模板)
版权提示 | 免责声明

1,本文(数据挖掘概论.课件.ppt)为本站会员(三亚风情)主动上传,163文库仅提供信息存储空间,仅对用户上传内容的表现方式做保护处理,对上载内容本身不做任何修改或编辑。
2,用户下载本文档,所消耗的文币(积分)将全额增加到上传者的账号。
3, 若此文所含内容侵犯了您的版权或隐私,请立即通知163文库(发送邮件至3464097650@qq.com或直接QQ联系客服),我们立即给予删除!


侵权处理QQ:3464097650--上传资料QQ:3464097650

【声明】本站为“文档C2C交易模式”,即用户上传的文档直接卖给(下载)用户,本站只是网络空间服务平台,本站所有原创文档下载所得归上传人所有,如您发现上传作品侵犯了您的版权,请立刻联系我们并提供证据,我们将在3个工作日内予以改正。


163文库-Www.163Wenku.Com |网站地图|