1、1Chapter 6.分类分类:Advanced Methodsn贝叶斯信念网络n后向传播分类 Classification by Backpropagationn支持向量机 Support Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)n其他分类方法nAdditional Topics Regarding ClassificationnSummary2贝叶斯信念网络贝叶斯信念网络nBayesian belief networks(又
2、称为 Bayesian networks,probabilistic networks):允许变量子集间定义类条件独立n(有向无环)因果关系的图模型n表示变量间的依赖关系n给出了一个联合概率分布q Nodes:随机变量q Links:依赖关系q X,Y 是Z的双亲,Y is the parent of PqZ 和 P间没有依赖关系q 没有环3贝叶斯信念网络贝叶斯信念网络:An ExampleFamilyHistory(FH)LungCancer(LC)PositiveXRaySmoker(S)EmphysemaDyspneaLCLC(FH,S)(FH,S)(FH,S)(FH,S)0.80.20
3、.50.50.70.30.10.9CPT:Conditional Probability Table for variable LungCancer:niYParentsixiPxxPn1)(|(),.,(1显示父母的每个可能组合的条件概率从从CPT推倒推倒 X的特定值得概率的特定值得概率4训练贝叶斯网路训练贝叶斯网路:几种方案几种方案nScenario 1:给定网络结构和所有变量观察:只计算CPTnScenario 2:网络结构已知,某些变量隐藏:梯度下降法(贪心爬山),i.e.,沿着准则函数的最速下降方向搜索解n权重初始化为随机值n每次迭代中,似乎是对目前的最佳解决方案前进,没有回溯n每次
4、迭代中权重被更新,并且收敛到局部最优解nScenario 3:网络结构未知,所有变量可知:搜索模型空间构造网络拓扑nScenario 4:未知结构,隐藏变量:目前没有好的算法nD.Heckerman.A Tutorial on Learning with Bayesian Networks.In Learning in Graphical Models,M.Jordan,ed.MIT Press,2019.5Chapter 6.分类分类:Advanced MethodsnBayesian Belief NetworksnClassification by BackpropagationnSup
5、port Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)nOther Classification MethodsnAdditional Topics Regarding ClassificationnSummary6用反向传播分类用反向传播分类n反向传播:一种神经网络学习算法 n最早是由心理学家和神经学家开创的,开发和测试神经元计算模拟n神经网络:一组连接的输入/输出单元,其中每个连接都与一个权重关联n通过调整权重来学习,能够输入元
6、组的正确类别标号n又被称为连接者学习connectionist learning7神经网络作为分类器神经网络作为分类器n弱点弱点n学习时间很长学习时间很长 n需要很多参数(常靠经验确定)需要很多参数(常靠经验确定),如网络的结构如网络的结构 n可解释性差可解释性差:很难解释权重和网络中很难解释权重和网络中“隐藏单元隐藏单元”的含义的含义n优势优势n对噪音数据的高承受能力对噪音数据的高承受能力n分类未经训练的模式的能力分类未经训练的模式的能力n非常适合处理连续值的输入非常适合处理连续值的输入/输出输出n成功地应用于现实数据成功地应用于现实数据,e.g.,手写字符识别手写字符识别n算法是固有并行的
7、算法是固有并行的n已经发展了一些从训练好的神经网路提取规则的技术已经发展了一些从训练好的神经网路提取规则的技术8多层前馈神经网络多层前馈神经网络 输出层输出层输入层输入层隐藏层隐藏层Output vectorInput vector:Xwijijkiikjkjxyyww)()()()1(9多层前馈神经网络多层前馈神经网络n网络的输入输入对应于每个训练元组的测量属性 n输入同时传给称作输入层输入层的单元n加权后同时传递给隐藏层n隐藏层的数目是任意的,通常只有一个n最后一个隐藏层的输出权重后作为输入传递给称为输出层,此处给出网络的预测n前馈feed-forward:权重都不反馈到输入单元或前一层的
8、输出单元n从统计学观点,网络进行一个非线性回归;给定足够的隐藏单元和训练数据,可以逼近任何函数10定义网络拓扑定义网络拓扑n确定网络拓扑确定网络拓扑:给定输入层的单元数,隐藏层数(if 1),每个隐藏层的单元数,输出层的单元数n规格化训练元组的输入值 0.01.0n对于离散值,可重新编码,每个可能的值一个输入单元并初始化0n输出输出,如果涉及超过两个类别则一个输出单元对应一个类别n一旦一个训练好的网络其准确率达不到要求时,用不同的网络拓扑和初始值重新训练网络11反向传播反向传播Backpropagationn迭代地处理训练数据&比较网络的预测值和实际的目标值n对每个训练元组,修改权重最小化目标
9、的预测值和实际值之间的mean squared errorn这种修改后向进行:从输出层开始,通过每个隐藏层直到第一个隐藏层n步骤n初始化权重为一个小的随机数,以及偏倚 biases n向前传播输入(应用激励函数)n向后传播误差(更新权重和偏倚)n停止条件(当误差非常小,etc.)12神经元神经元:一个隐藏一个隐藏/输出层单元输出层单元 n一个n-维输入向量x被映射被映射到变量y,通过非线性函数映射n单元的输入是前一层的输出.被乘上权重后求和且加上此单元的偏倚.然后应用一个非线性激励函数.mk f输入向量输入向量xoutput y激励激励weightvector ww0w1wnx0 x1xn)s
10、ign(yExampleFor n0ikiixwmbias权重和权重和后向传播算法后向传播算法14效率和可解释性效率和可解释性n向后传播的效率向后传播的效率:每次迭代 O(|D|*w),|D|为元组数,w 个权重,最坏的情况下迭代的次数可能是元组数的指数n为了更容易理解:通过网络修剪网络修剪提取规则n简化网络结构,去除对训练的网络有最弱影响的权重连接n对连接,单元,or 活跃值聚类n输入和活跃值集合用来推导描述输入和隐藏层间关系的规则nSensitivity analysis:评估一个给定的输入变量对网络输出的影响。从中获得的知识可以表示为规则。nIF X 减少5%THEN Y增加15Chap
11、ter 6.分类分类:Advanced Methodsn贝叶斯信念网络n后向传播分类 Classification by Backpropagationn支持向量机 Support Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)n其他分类方法nAdditional Topics Regarding ClassificationnSummary16分类分类:一个数学映射一个数学映射nClassification:预测分类的类标签nE.g
12、.,个人主页分类nxi=(x1,x2,x3,),yi=+1 or 1nx1:#of word“homepage”nx2:#of word“welcome”n x X=n,y Y=+1,1,n推导一个函数 f:X Y n线性分类n二元分类问题n红线上面的点属于 class xn下面的点属于 class onExamples:SVM,Perceptron,Probabilistic Classifiersxxxxxxxxxxooooooooooooo17SVMSupport Vector Machinesn一个相对新的分类方法,适用于linear and nonlinear datan使用一个非线
13、性映射把原始训练数据变换到高维空间中n在新的维上,搜索线性优化分离超平面hyperplane(i.e.,“决策边界”)n用一个合适的足够高维的映射,两类数据总是可以被超平面分开nSVM 使用support vectors(“基本”选练元组)和边缘margins(由支持向量定义)发现超平面18SVM历史和应用历史和应用nVapnik and colleagues(1992)基础工作来自于Vapnik&Chervonenkis statistical learning theory in 1960snFeatures:训练慢但是准确度高,由于能够建模非线性决策边界(margin maximizat
14、ion)nUsed for:分类和数值预测n应用:n手写数字识别,object recognition,speaker identification,基准时间序列预测检验 19支持向量机的一般哲学支持向量机的一般哲学Support VectorsSmall Margin边界Large Margin2022年8月12日星期五Data Mining:Concepts and Techniques20SVMMargins and Support Vectors21SVM当数据线性可分时当数据线性可分时mD 为(X1,y1),(X|D|,y|D|),其中 Xi 带标签yi的训练元组有无数条直线(hyp
15、erplanes)可以分离两个类,但我们需要发现最好的一个(对未知数据有最小化的分类误差)SVM searches for the hyperplane with the largest margin,i.e.,maximum marginal hyperplane(MMH)22SVM线性可分线性可分n一个分离超平面可以写成W X+b=0W=w1,w2,wn 权重向量和标量b(bias)n对于2-D,可以写成w0+w1 x1+w2 x2=0n超平面定义了边缘的边界:H1:w0+w1 x1+w2 x2 1 for yi=+1,andH2:w0+w1 x1+w2 x2 1 for yi=1n任何一
16、个位于超平面H1 or H2(i.e.,the sides defining the margin)的样本为 support vectorsn最大边缘是最大边缘是 2/wmaxn是一个 constrained(convex)quadratic optimization problem:二次目标函数和线性约束 Quadratic Programming(QP)Lagrangian multipliers23Why Is SVM Effective on High Dimensional Data?n训练后的分类器的complexity由支持向量数而不是数据维度刻画n支持向量support vec
17、tors是基本的/临界的训练元组离决策边界最近(MMH)n如果其他的样本删掉后重新训练仍然会发现相同的分离超平面n支持向量的数目可用于计算(svm分类器)期望误差率的上界(upper),其独立于数据维度n一个只有少量支持向量的svm有很好的推广性能,即使数据的维度很高时24SVM线性不可分线性不可分n把原始输入数据变换到一个更高维的空间nSearch for a linear separating hyperplane in the new spaceA1A225SVM:不同的核函数不同的核函数n计算变换后数据的点积,数学上等价于应用一个核函数K(Xi,Xj)于原始数据,i.e.,K(Xi,X
18、j)=(Xi)(Xj)nTypical Kernel FunctionsnSVM 也可用于多类数据(2)和回归分析(需要附加参数)26SVM vs.Neural NetworknSVMnDeterministic algorithmnNice generalization propertiesnHard to learn 使用 quadratic programming techniques批量学习nUsing kernels can learn very complex functionsnNeural NetworknNondeterministic algorithmnGeneraliz
19、es well but doesnt have strong mathematical foundationnCan easily be learned in incremental fashionnTo learn complex functionsuse multilayer perceptron(nontrivial)27SVM Related LinksnSVM Website:kernel-machines.org/nRepresentative implementationsnLIBSVM:an efficient implementation of SVM,multi-class
20、 classifications,nu-SVM,one-class SVM,including also various interfaces with java,python,etc.nSVM-light:simpler but performance is not better than LIBSVM,support only binary classification and only in C nSVM-torch:another recent implementation also written in C28Chapter 6.惰性学习惰性学习nBayesian Belief Ne
21、tworksnClassification by BackpropagationnSupport Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)nOther Classification MethodsnAdditional Topics Regarding ClassificationnSummary29Lazy vs.Eager LearningnLazy vs.eager learningnLazy learning(e.g.,
22、基于实例的学习):仅存储数据(或稍加处理)直到碰到检验元组才开始处理nEager learning(前面介绍的方法):给定训练数据,在遇到待处理的新数据前构造分类模型nLazy:训练用时很少,预测时用时多n准确性n惰性学习方法可以有效地利用更丰富的假设空间,使用多个局部线性函数来对目标函数形成一个隐式的全局逼近nEager:必须限于一个假设,它覆盖了整个实例空间30Lazy Learner:基于实例的方法基于实例的方法nInstance-based learning:nStore training examples and delay the processing(“lazy evaluati
23、on”)until a new instance must be classifiedn典型的方法nk-nearest neighbor approachn实例表示为欧氏空间中的点.nLocally weighted regressionnConstructs local approximationn基于案例的推理Case-based reasoningn使用符号表示和知识为基础的推理31k-最近邻算法最近邻算法n所有的样本对应于n-D 空间的点n通过Euclidean distance,dist(X1,X2)定义最近邻居n目标函数可以是discrete-or real-值n对于离散值,k-N
24、N 返回与目标元组最近的k个训练样本的多数类nVonoroi diagram:the decision surface induced by 1-NN for a typical set of training examples ._+_xq+_+_+.32k-NN Algorithm的讨论的讨论nk-NN:元组的未知实值的预测时n返回与未知元组k个最近邻居的平均值(对应属性)nDistance-weighted nearest neighbor algorithmn根据与目标元组的距离权重组合k个近邻的贡献nGive greater weight to closer neighborsnRo
25、bust to noisy data by averaging k-nearest neighborsnCurse of dimensionality:邻居间的距离会被无关联的属性影响 n坐标轴伸缩或去除次要的属性2),(1ixqxdw33基于案例的推理基于案例的推理(CBR)nCBR:使用一个问题解的数据库来求解新问题n存储符号描述(tuples or cases)不是Euclidean space的点n应用:顾客-服务台(产品有关的诊断),合法裁决nMethodologyn实例表示为复杂的符号描述(e.g.,function graphs)n搜索相似的案例,组合多个返回的例子nTight
26、coupling between case retrieval,knowledge-based reasoning,and problem solvingnChallengesnFind a good similarity metric nIndexing based on syntactic similarity measure,and when failure,backtracking,and adapting to additional cases34Chapter 6.分类分类:其他方法其他方法nBayesian Belief NetworksnClassification by Ba
27、ckpropagationnSupport Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)nOther Classification MethodsnAdditional Topics Regarding ClassificationnSummary35遗传算法遗传算法(GA)nGenetic Algorithm:模仿生物进化n使用随机产生的规则组成一个最初的populationn每个规则有一系列位表示nE.g.,if A1 and
28、A2 then C2 can be encoded as 100 n如果一个属性有 k 2 个值,使用k位 n基于适者生存原理,最适合的规则及其后代组成新的种群n规则的拟合度用它在训练样本的准确率来评估n通过交叉和突变来产生后代n此过程持续下去,直到种群P进化到其中的每个规则满足给定的拟合度阈值n算法慢,但易于并行36Fuzzy Set ApproachesnFuzzy logic 使用0.0,1.0真值来表示类的成员的隶属度 n属性值被转化成模糊值.Ex.:n对于每个离散类别收入low,medium,high,x 被分配一个模糊的隶属值,e.g.$49K 属于“medium income”0
29、.15,属于“high income”的隶属值是0.96n模糊隶属值的和不一定等于1.n每个可用的规则为类的隶属贡献一票n通常,对每个预测分类的真值求和,并组合这些值37Chapter 6.分类分类:Advanced MethodsnBayesian Belief NetworksnClassification by BackpropagationnSupport Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)nOther Class
30、ification MethodsnAdditional Topics Regarding ClassificationnSummary多类分类多类分类n分类时设计多个类别(i.e.,2 Classes)nMethod 1.One-vs.-all(OVA):每次学习一个分类器 n给定m个类,训练m个分类其,每个类别一个n分类器 j:把类别 j的元组定义为 positive&其他的为 negativen为分类样本 X,所有分类器投票来集成nMethod 2.All-vs.-all(AVA):为每一对类别学习一个分类器nGiven m classes,construct m(m-1)/2 bina
31、ry classifiersn使用两个类别的元组训练一个分类器n为分类元组X,每个分类器投票.X is assigned to the class with maximal votenComparisonnAll-vs.-all tends to be superior to one-vs.-allnProblem:Binary classifier is sensitive to errors,and errors affect vote count38多类分类的多类分类的Error-Correcting Codesn最初目的是在数据传输的通讯任务中通过探索数据冗余来修正误差。例:nA 7-
32、bit codeword associated with classes 1-439ClassError-Corr.CodewordC111 1 1111C200 0 0111C300 1 1001C401 0 1010n给定未知元组给定未知元组X,7个分类器的结果为:0001010nHamming distance:#两个码字间不同位数的和nH(X,C1)=5,检查 1111111&0001010间不同位数和nH(X,C2)=3,H(X,C3)=3,H(X,C4)=1,thus C4 as the label for X nError-correcting codes can correct
33、 up to(h-1)/h 1-bit error,where h is the minimum Hamming distance between any two codewords nIf we use 1-bit per class,it is equiv.to one-vs.-all approach,the code are insufficient to self-correctnWhen selecting error-correcting codes,there should be good row-wise and col.-wise separation between th
34、e codewords 半监督分类半监督分类nSemi-supervised:使用有标签和无标签数据构造分类器nSelf-training:nBuild a classifier using the labeled datanUse it to label the unlabeled data,and those with the most confident label prediction are added to the set of labeled datan重复以上过程nAdv:容易理解;disadv:可能增大误差nCo-training:Use two or more classi
35、fiers to teach each othern每个学习者使用元组的相互独立的特征集合来训练一个好的分类器F1n然后 f1 and f2 用来预测未知元组 X 的类别标签nTeach each other:The tuple having the most confident prediction from f1 is added to the set of labeled data for f2,&vice versa nOther methods,e.g.,joint probability distribution of features and labels40主动学习主动学习Ac
36、tive Learningn获取类标签是昂贵nActive learner:query human(oracle)for labelsnPool-based approach:Uses a pool of unlabeled datanL:D中有标签的样本子集,U:D的一个未标记数据集n使用一个查询函数小心地从U选择1或多个元组,并咨询标签an oracle(a human annotator)nThe newly labeled samples are added to L,and learn a modelnGoal:Achieve high accuracy using as few l
37、abeled data as possiblenEvaluated using learning curves:Accuracy as a function of the number of instances queried(#of tuples to be queried should be small)nResearch issue:How to choose the data tuples to be queried?nUncertainty sampling:choose the least certain onesnReduce version space,the subset o
38、f hypotheses consistent w.the training datanReduce expected entropy over U:Find the greatest reduction in the total number of incorrect predictions41迁移学习:概念框架迁移学习:概念框架nTransfer learning:Extract knowledge from one or more source tasks and apply the knowledge to a target tasknTraditional learning:每一个任
39、务建立分类器nTransfer learning:Build new classifier by applying existing knowledge learned from source tasks42Traditional Learning FrameworkTransfer Learning Framework迁移学习迁移学习:Methods and Applicationsn应用:数据过时或分布的变化时,e.g.,Web document classification,e-mail spam filteringnInstance-based transfer learning:Re
40、weight some of the data from source tasks and use it to learn the target tasknTrAdaBoost(Transfer AdaBoost)n假定源和目标数据用相同的属性和类别描述,but rather diff.distributionsnRequire only labeling a small amount of target datan训练中使用源数据:When a source tuple is misclassified,reduce the weight of such tupels so that the
41、y will have less effect on the subsequent classifiernResearch issuesnNegative transfer:When it performs worse than no transfer at allnHeterogeneous transfer learning:Transfer knowledge from different feature space or multiple source domainsnLarge-scale transfer learning4344Chapter 6.分类分类:频繁模式频繁模式nBa
42、yesian Belief NetworksnClassification by BackpropagationnSupport Vector MachinesnClassification by Using Frequent PatternsnLazy Learners(or Learning from Your Neighbors)nOther Classification MethodsnAdditional Topics Regarding ClassificationnSummary45关联分类关联分类n关联分类关联分类:主要步骤主要步骤n挖掘关于频繁模式(属性-值对的联结)和类标签
43、间的强关联n产生以下形似的关联规则 P1 p2 pl “Aclass=C”(conf,sup)n组织规则,形成基于规则的分类器n为什么有效为什么有效?n可以发现(在多个属性间)高置信度的关联,可以克服决策树规约引入的约束,决策树一次考虑一个属性n研究发现,关联分类通常比某些传统的分类方法更精确,例如C4.546典型的关联分类方法典型的关联分类方法nCBA(Classification Based on Associations:Liu,Hsu&Ma,KDD98)n挖掘可能关联规则:Cond-set(属性-值 的集合)class labeln建立分类器:基于置信度和支持度的下降序组织规则nCMA
44、R(Classification based on Multiple Association Rules:Li,Han,Pei,ICDM01)n分类:多个规则的统计分析nCPAR(Classification based on Predictive Association Rules:Yin&Han,SDM03)n产生预测性规则(FOIL-like analysis)允许覆盖的元组以降低权重形式保留下来构造新规则n(根据期望准确率)使用最好的k 个规则预测n更有效(产生规则少),精确性类似CMAR47频繁模式频繁模式 vs.单个特征单个特征(a)Austral(c)Sonar(b)CleveF
45、ig.1.Information Gain vs.Pattern Length某些频繁模式的判别能力高于单个特征.48经验结果经验结果 010020030040050060070000.10.20.30.40.50.60.70.80.91InfoGainIG_UpperBndSupport Information Gain(a)Austral(c)Sonar(b)BreastFig.2.Information Gain vs.Pattern Frequency49特征选择特征选择Feature Selectionn给定频繁模式集合,存在non-discriminative和redundant
46、的模式,他们会引起过度拟合n我们希望选出discriminative patterns,并且去除冗余n借用Maximal Marginal Relevance(MMR)的概念nA document has high marginal relevance if it is both relevant to the query and contains minimal marginal similarity to previously selected documents50实验结果实验结果5051Scalability Tests52基于频繁模式的分类基于频繁模式的分类nH.Cheng,X.Ya
47、n,J.Han,and C.-W.Hsu,“Discriminative Frequent Pattern Analysis for Effective Classification”,ICDE07nAccuracy issue问题nIncrease the discriminative powernIncrease the expressive power of the feature spacenScalability issue问题nIt is computationally infeasible to generate all feature combinations and filt
48、er them with an information gain thresholdnEfficient method(DDPMine:FPtree pruning):H.Cheng,X.Yan,J.Han,and P.S.Yu,Direct Discriminative Pattern Mining for Effective Classification,ICDE08 53DDPMine Efficiency:RuntimePatClassHarmonyDDPMinePatClass:ICDE07 Pattern Classification Alg.54SummarynEffective
49、 and advanced classification methods nBayesian belief network(probabilistic networks)nBackpropagation(Neural networks)nSupport Vector Machine(SVM)nPattern-based classificationnOther classification methods:lazy learners(KNN,case-based reasoning),genetic algorithms,rough set and fuzzy set approachesnA
50、dditional Topics on ClassificationnMulticlass classificationnSemi-supervised classificationnActive learningnTransfer learning55References(1)nC.M.Bishop,Neural Networks for Pattern Recognition.Oxford University Press,2019nC.J.C.Burges.A Tutorial on Support Vector Machines for Pattern Recognition.Data