1、 1 2 例子1 如何找出美国人最喜欢的派? 3 例子1 根据销售记录,30寸的派中,苹果派卖的最好 对于11寸的派,苹果派只能排4到5名,为什么呢? 4 例子1 30寸的派:整个家庭需要都能够接受的口味,苹果派是大家都能接受的,但是不一定是最喜欢的,妥协的结果 11寸的派,自己一个人吃,选择自己最喜欢的口味 更多的数据 获得小数据无法获得的信息 5 例子2You receive an email from a medical researcher concerning a project that you are eager to work on.Hi,Ive attached the da
2、ta file that I mentioned in my previous email. Each line contains the information for a single patient and consists of five fields. We want to predict the last field using the other fields. I dont have time to provide any more information about the data since Im going out of town for a couple of day
3、s, but hopefully that wont slow you down too much. Thanks and see you in a couple of days. 6 例子2Despite some misgivings, you proceed to analyze the data. The first few rows of the file are as follows:Nothing looks strange. You put your doubts asideand start the analysis.Two days later you you arrive
4、 for the meeting, and while waiting for others to arrive, you strike up a conversation with a statistician who is working on the project. 7 例子2Statistician: So, you got the data for all the patients?Data Miner: Yes. I havent had much time for analysis, but I do have a few interesting results.Statist
5、ician: Amazing. There were so many data issues with this set of patients that I couldnt do much.Data Miner: Oh? I didnt hear about any possible problems.Statistician: Well, first there is field 5, the variable we want to predict. Its common knowledge among people who analyze this type of data that r
6、esults are better if you work with the log of the values, but I didnt discover this until later. Was it mentioned to you?Data Miner: No.Statistician: But surely you heard about what happened to field 4? Its supposed to be measured on a scale from 1 to 10, with 0 indicating a missing value, but becau
7、se of a data entry error, all 10s were changed into 0s. Data Miner: Interesting. Were there any other problems?Statistician: Yes, fields 2 and 3 are basically the same, but I assume that you probably noticed that.Data Miner: Yes, but these fields were only weak predictors of field 5. 8 例子2Statistici
8、an: Anyway, given all those problems, Im surprised you were able to accomplish anything.Data Miner: True, but my results are really quite good. Field 1 is a very strong predictor of field 5. Im surprised that this wasnt noticed before.Statistician: What? Field 1 is just an identification number.Data
9、 Miner: Nonetheless, my results speak for themselves.Statistician: Oh, no! I just remembered. We assigned ID numbers after we sorted the records based on field 5. There is a strong connection, but its meaningless. Sorry.Lesson: Get to know your data! 9 挖掘 山西挖矿 前提是有矿,包括煤矿的储藏量,储藏深度,煤的成色 之后是挖矿,要把这些埋在地下
10、的矿挖出来,需要挖矿工,挖矿机,运输机 之后是加工,洗煤,炼丹,等等 最后才是转化为银子 数据挖掘 前提是有数据,包括数据储藏量,储藏深度,数据的成色 之后是数据挖掘,要把这些埋藏的数据挖掘出来 之后是把数据可视化输出,指导分析、商业实践 直到这一步,才创造了价值大数据:一座正在形成的巨型矿山!大数据:一座正在形成的巨型矿山! 10 越来越多的数据被收集 Web数据 电子商务 购买消费记录 银行支付转账信息 电脑变得强大和廉价 商业竞争压力变大 提供更好的,更个性化的服务给用户 发现用户的兴趣背景:商业视角 11 背景:科学视角 数据收集及存储的速度变大 (GB/hour) 卫星上面的传感器
11、天文望远镜 基因数据 科学仿真:生产大量数据 传统技术无法应对这么大的数据量 数据挖掘 数据的处理和分类 形成研究假设 12 挖掘大规模数据集 数据中经常存在数据中经常存在“隐藏隐藏”的信息的信息 人工分析需要花费大量的数据,也很难发现其背后的信息人工分析需要花费大量的数据,也很难发现其背后的信息 现实中,大量的数据都没有被分析过现实中,大量的数据都没有被分析过0500,0001,000,0001,500,0002,000,0002,500,0003,000,0003,500,0004,000,00019951996199719981999The Data GapTotal new disk
12、(TB) since 1995Number of analysts From: R. Grossman, C. Kamath, V. Kumar, “Data Mining for Scientific and Engineering Applications” 13 数据挖掘定义 技术角度的定义技术角度的定义数据挖掘(Data Mining)是从大量的、不完全的、有噪声的、模糊的、随机的实际应用数据中,提取隐含在其中的、人们事先不知道的、但又是潜在有用的信息和知识的过程近义词:数据融合、数据分析和决策支持等数据源必须是真实的、海量的、含噪声的发现的是用户感兴趣的知识发现的知识要可接受、可理解
13、、可运用并不要求发现放之四海皆准的知识,仅支持特定的发现问题 14 数据挖掘定义 商业角度的定义商业角度的定义 数据挖掘是一种新的商业信息处理技术,其主要特点是对商业数据库中的大量业务数据进行抽取、转换、分析和其他模型化处理,从中提取辅助商业决策的关键性信息 一类深层次的数据分析方法 可以描述为:按企业既定业务目标,对大量的企业数据进行探索和分析,揭示隐藏的、未知的或验证己知的规律性,并进一步将其模型化的有效方法 15 什么是数据挖掘?l 什么是数据挖掘? 在美国的特定地点,特定的名字比较流行 (OBrien, ORurke, OReilly in Boston area) 把搜索引擎关于“A
14、mazon”的返回结果按照相似性分为不同的小组l 什么不是数据挖掘? 从电话库中查询电话号码 利于搜索引擎查找“Amazon”的信息 16 基本概念 信息:事物运动的状态和状态变化的方式 数据:指一个有关事实F的集合 如学生档案数据库中有关学生基本情况的各条记录 用来描述事物有关方面的信息 一般而言,这些数据都是准确无误的 数据可能存储在数据库、数据仓库和其他信息资料库中 17 基本概念 知识 人们实践经验的结晶且为新的实践所证实的;是关于事物运动的状态和状态变化的规律;是对信息加工提炼所获得的抽象化产物。 知识的形式可能是模式、关联、变化、异常以及其他有意义的结构。 18 基本概念 模式 对
15、于集合F中的数据,我们可以用语言L来描述其中数据的特性,得出一个表达式E,E所描述的数据是集合F的一个子集FE。只有当表达式E比列举所有FE中元素的描述方法更为简单时,我们才可称之为模式。如:“如果成绩在81-90之间,则成绩优良”可称为一个模式,而“如果成绩为81、82、83、84、85、86、87、88、89或90,则成绩优良”则不能不能称之为一个模式。 19 数据挖掘里程碑 1763 年,Thomas Bayes 的论文在他死后发表 Bayes 理论将当前概率与先验概率联系起来 Bayes 理论能够帮助理解基于概率估计的复杂现况 成为数据挖掘和概率论的基础 1805 年, Adrien-
16、Marie Legendre 和 Carl Friedrich Gauss 使用回归确定了天体(彗星和行星)绕行太阳的轨道 回归分析的目标是估计变量之间的关系 在这个例子中采用的方法是最小二乘法 回归成为数据挖掘的重要工具之一 20 数据挖掘里程碑 1936 年,计算机时代即将到来,海量数据的收集和处理成为可能 1936年发表的论文On Computable Numbers中,Alan Turing 介绍了通用图灵机的构想 通用机具有像今天的计算机一般的计算能力 现代计算机就是在图灵这一开创性概念上建立起来的 1943 年,Warren McCullon 和 Walter Pitts 首先构建
17、出神经网络的概念模型 A logical calculus of the ideas immanent in nervous activity 的论文阐述了网络中神经元的概念 每一个神经元可以做三件事情:接受输入,处理输入和生成输出。 21 数据挖掘里程碑 1975 年,John Henry Holland 所著的自然与人工系统中的适应问世 成为遗传算法领域具有开创意义的著作 讲解了遗传算法领域中的基本知识,阐述理论基础,探索其应用 1989 年,术语“数据库中的知识发现”(KDD)被Gregory Piatetsky-Shapiro 提出 合作建立起第一个同样名为KDD的研讨会 22 数据挖
18、掘里程碑 1992 年,Berhard E. Boser, Isabelle M. Guyon 和 Vladimir N. Vanik对原始的支持向量机提出了一种改进办法 新的支持向量机充分考虑到非线性分类器的构建 1993 年,Gregory Piatetsky-Shapiro 创立“ Knowledge Discovery Nuggets (KDnuggets) ”通讯 本意是联系参加KDD研讨会的研究者 KD的读者群现在似乎广泛得多 23 数据挖掘里程碑 2003 年,Micheal Lewis 写的 点球成金 出版 奥克兰运动家队(美国职业棒球大联盟球队)使用一种统计的,数据驱动的方式
19、针对球员的素质进行筛选,这些球员被低估或者身价更低 成功组建了一支打进2002和2003年季后赛的队伍,而他们的薪金总额只有对手的1/3 24 数据挖掘里程碑 在 2015 年二月,DJ Patil成为白宫第一位首位数据科学家 数据挖掘:商业、科学、工程和医药、信用卡交易、股票市场、国家安全、基因组测序、临床试验 25 相关领域 人工智能 机器学习 模式识别 统计学 数据库 . 26 人工智能 20世纪50年代到70年代,“推理期” A. Newell and H. Simon, “逻辑理论家” “通用问题求解” 20世纪70年代中期,“知识期” 专家系统 由人把知识总结出来再教给计算机相当困
20、难 20世纪80年代,“机器学习” 1980,在CMU举行第一届机器学习研讨会 1983,Machine Learning: An Artificial Intelligence Approach 1986,Machine Learning创刊 27 机器学习分类 从例子中学习 监督学习(分类、回归) 非监督学习(聚类) 在问题求解和规划中学习 通过观察和发现学习 从指令中学习R.S. Michalski, J.G. Carbonell, T.M. Mitchell, eds, “Machine Learning:An Artificial Intelligence Approach,” Pa
21、lo Alto, CA: Tioga Publishing Co., 1983. 28 从例子中学习 20世纪90年代中期之前,归纳逻辑程序设计(Inductive Logic Programming) 知识表达能力,复杂数据极其关系 利用领域知识指导学习 从数据中学习领域知识 学习过程面临的假设空间太大 29 数据挖掘 vs 知识发现(KDD) 数据挖掘是KDD中利用算法处理数据的步骤 逐渐演变成KDD的同义词 30 数据挖掘 vs 机器学习 机器学习:利用经验来改善计算机系统自身的性能 数据挖掘(知识发现):从海量数据中找出有用的知识 利用机器学习界提供的技术来分析海量数据 利用数据库界提
22、供的技术来管理海量数据 31 数据挖掘 vs 统计学 数据挖掘很多工作由统计方法完成 目标相似,许多算法源于数理统计 部分统计学家认为数据挖掘是统计学的分支 大部分数据挖掘研究人员不这么认为 32 数据挖掘 vs 传统数据分析方法 数据源 数据是海量的 数据有噪声 数据可能非结构化,异构多源 传统数据分析方法:假设驱动 给出一个假设,然后通过数据验证 数据挖掘:发现驱动 模式从数据中自动提取出来 发现不能靠直觉发现的信息或知识 挖掘出的信息越出乎意料,可能越有价值 33 实现流程实现流程原始数据数据模式知识应用准备挖掘解释运用 各步骤之间互相影响、反复调整,形成一种螺旋式上升过程。 34 Da
23、ta Mining Tasks Prediction Methods Use some variables to predict unknown or future values of other variables. Description Methods Find human-interpretable patterns that describe the data.From Fayyad, et.al. Advances in Knowledge Discovery and Data Mining, 1996 35 Data Mining Tasks. Classification Pr
24、edictive Clustering Descriptive Association Rule Discovery Descriptive Sequential Pattern Discovery Descriptive Regression Predictive Anomaly Detection Predictive 36 http:/ 37 Challenges of Data Mining Scalability Dimensionality Complex and Heterogeneous Data Data Quality Data Ownership and Distribu
25、tion Privacy Preservation Streaming Data 38 相关学术会议 SIGIR, KDD, ICDM, SDM, CIKM, PAKDD WWW, WSDM AAAI, IJCAI VLDB, SIGMOD, ICDE BigData ICML, NIPS 39 相关学术期刊 IEEE Transactions on Knowledge and Data Engineering(TKDE) ACM Transactions on Knowledge Discovery from Data (TKDD) ACM Transactions on Intellige
26、nt Systems and Technology (TIST) ACM Transactions on Information Systems(TOIS) IEEE Transactions on Systems, Man, and Cybernetics, Part B IEEE Transactions on Neural Network (TNN) Knowledge and Information Systems (KAIS) Pattern Recognition (PR) 40 相关比赛 阿里天池比赛: http:/ IJCAI: http:/ijcai15.org/index.
27、php/repeat-buyers-prediction-competition Kaggle: https:/ DataCastle: http:/ ImageNet: http:/image-net.org/challenges/LSVRC/2015/index KDD Cup: https:/ Angry Birds AI Competition: http:/aibirds.org/ 41 Some of my research work 42 Recommender System Motivation: Information Overload Solution: Search En
28、gines Recommender Systems 43 Missing Value Prediction With CF Approach 1: Neighborhood-based approach Approach 2: Model-based approach Approach 3: Time-aware approach Approach 4: Network coordinate based approach Approach 5: Ranking-based approach 44 44Similarity ComputationApproach 1: Neighborhood-
29、Based User-item matrix: MN, each entry is the failure probability of a Web service Pearson Correlation Coefficient (PCC)?0.5 45 Similar User SelectionApproach 1: Neighborhood-Based For a user u, a set of similar users S(u) can be found by: Simk is the kth largest PCC value with the current user u. S
30、im(u, a) 0 is to exclude the dissimilar users. Sim(u, a) can be calculated by PCC. 46 User-based Prediction (UPCC)Approach 1: Neighborhood-Based Given a missing value pu,i, if the user u has similar users (S(u) null), the missing value can be predicted by: and are average failure probabilities of di
31、fferent Web services observed by user u and user a. wa can be calcualted by: 47 WSRec: Hybrid Prediction ApproachApproach 1: Neighborhood-Based Similar users + Similar Web services 48 ExperimentsApproach 1: Neighborhood-Based Metrics of Prediction Accuracy : the expected value : the predicted value
32、: the number of predicted values 49 49Performance ComparisonApproach 1: Neighborhood-Based 50 50Drawbacks of Neighborhood-based ApproachApproach 1: Neighborhood-Based Computational complexity Matrix sparsity problemNot easy to find similar users (or similar items) 51 Approach 2: Model-based Approach
33、Approach 2: Model-Based A small number of factors influencing the QoS performance A users Web service QoS values correspond to a linear combination of the factors Each row of UT is a set of feature factors, and each column of V is a set of linear predictors Matrix Factorization (MF)s1 s2 s3 s4 s5 s6
34、 The error between the actualValue and the predictionRegularization termsIEEE TSC13aUT V 52 NIMF: NeighborhoodIntegrated Matrix FactorizationApproach 2: Model-BasedUsers own ratingRating due to similar users 53 Performance ComparisonApproach 2: Model-Based 54 Approach 3: Time-Aware Approach Time-awa
35、re personalized QoS evaluation on Web services is essential for Automatic selection Dynamical composition ISSRE11Approach 3: Time-Aware Approach 55 Model-Based: Tensor FactorizationApproach 3: Time-Aware ApproachUser 56 Latent Features LearningApproach 3: Time-Aware Approach Objective function Local
36、 optimal solution is found by incremental gradient descentThe error between estimated tensor and the original tensorRegularization terms which constrain the norms of U, S and T, to avoid overfitting problemRegularization term which prevents the predicted QoS values from varying a lot against the ave
37、rage QoS value 57 Missing Value PredictionApproach 3: Time-Aware Approach Given feature spaces U, S and T, a missing value is predicted by evaluating how the features of corresponding user, service and time apply to each other Example:UserTimeServicelatent-service matrixlatent-user matrixu1 u2 u3 u4
38、 u5latent-time matrix s1 s2 s3 s4 s5 t1 t2 t3 t4 t5341Y 58 Performance ComparisonsApproach 3: Time-Aware Approach Matrix Factorization extended methodsMF1 This method considers the user-service-time tensor as a set of user-service matrix slices in terms of time. Then it applies MF on each matrix sli
39、ce.MF2 This method compresses the user-service-time tensor into a user-service matrix. Then it applies MF on this user-service matrix. Tensor Factorization methodsTF Tensor factorization-based prediction method.WSPred Tensor factorization-based prediction method with average QoS value constraints. 5
40、9 Experimental ResultsApproach 3: Time-Aware Approach A smaller MAE or RMSE value means a better performance 925% 515% 312% 1622% 313% 112%Performance improvement of WSPred 60 Approach 4: Network Coordinate Based Key idea: leverage both historical QoS information and active measurements to enhance Q
41、oS prediction Advantage: solve the cold-start problem in MF modelsA Prototype of Network Coordinate SystemApproach 4: Network Coordinate-based Prediction ICWS12 61 Algorithm OverviewApproach 4: Network Coordinate BasedLandmark Coordinate ComputationWeb Service Coordinate ComputationService User Coor
42、dinate ComputationResponse Time PredictionWeb Service SelectionOffline Coordinates UpdatingOnline Web Service Selection 62 Distance matrix between n landmarkswhereSquared sum of prediction errorRegularization termEuclidean distanceminLandmarksApproach 4: Network Coordinate BasedLandmark Coordinate C
43、omputation 63 Distance matrix between n landmarks and w Web service hostsminSquared sum of errorRegularization termWeb service hostApproach 4: Network Coordinate BasedWeb Service Coordinate Computation 64 minService userWeb service hostsHistorical dataReference information of landmarksAvailable hist
44、orical data constraints Regularization termApproach 4: Network Coordinate BasedService User Coordinate Computation 65 The set of Webservices with unknown response time dataThe coordinate of service user uThe coordinate of Web service siApproach 4: Network Coordinate BasedQoS Prediction ResultPerform
45、ance Comparison 66 Approach 5: Ranking-Based PredictionSelect the optimal Web service from the candidatesNeighborhood-based approaches:Predict QoS values rank the candidates Ranking-based approach: Rank the candidates directly without predicting QoS valuesExpected values: (2, 3, 5) on (ws1, ws2, ws3
46、) Prediction 1: (3, 2, 4); MAE = (|2-3| + |3-2| + |5-4|)/3 = 1 Prediction 2: (1, 2, 3); MAE = (|2-1| + |3-2| + |5-3|)/3 = 1.3Ranking-based: Expected Ranking: ws1 ws2 ws3 Prediction 1: ws2 ws1 ws3 Prediction 2: ws1 ws2 ws3 67 Approach 5: Ranking-Based Prediction User preference on two Web services wh
47、ich have been invoked previously: User preference on pairs of Web services that have not both been invoked by the current user: 68 Approach 5: Ranking-Based Prediction Kendall Rank Correlation Coefficient (KRCC) N is the number of Web services. C is the number of concordant pairs between two ranking
48、s. D is the number of discordant pairs.Target Web services: (ws1, ws2, ws3) User 1 observed response-time: (2, 3, 5) User 2 observed response-time: (1, 2, 3)User 1: ws1 ws2, ws1 ws3, ws2 ws3User 2: ws1 ws2, ws1 ws3, ws2 ws3N = 3; C = 3; D = 0; Sim(user1, user2) = (3-0) / (3(3-1)/2) = 1 Target Web se
49、rvices: (ws1, ws2, ws3) User 1 observed response-time: (2, 3, 5) User 2 observed response-time: (3, 2, 1)User 1: ws1 ws2, ws1 ws3, ws2 ws2, ws1 ws3, ws2 ws3N = 3; C = 0; D = 3; Sim(user1, user2) = (0-3) / (3(3-1)/2) = -1 69 Given a preference function, choose a ranking that agrees with the preferenc
50、es as much as possible. Object function: Target: produce a ranking that maximizes the objective function. Trivial Solution: search through possible rankings n! possible rankings NP-Complete problemApproach 5: Ranking-Based Prediction 70 Step 1: For each service candidate, calculate the sum of prefer