大学精品课件：CHAPTER12 智能信息处理深度学习.ppt_163文库

资源描述

1、第第11章章深度学习及应用深度学习及应用钱钱峰峰信息与通信工程学院信息与通信工程学院内容概要内容概要l感知器介绍感知器介绍l深度学习基础深度学习基础l全连接网络全连接网络l深度学习深度学习VsVs浅层学习浅层学习l卷积神经网络卷积神经网络l生成对抗网络生成对抗网络l深度学习网络深度学习网络BPBP训练算法训练算法引言引言Jeff DeanMany predictions,by 2025 2030,1-2 billion people will lose their jobs to AI自然界神经元自然界神经元l 人类有大约100亿个神经元/神经细胞（还有更多的支持细胞）l 每个神经元有3个

2、部分：细胞体，树突，轴突连接到约10,000个其它神经元。通过1000兆突触连接相互传递信号，大约每秒处理器1万亿比特l 人类记忆容量11000TB信号来自其它神经元有突触权重自适应调整信号传递到其它神经元人类：长达1米人类自然神经系统擅长做什么？人类自然神经系统擅长做什么？l 视力视力l 听力（非常适应）听力（非常适应）l 语音识别语音识别/讲话讲话l 驾驶驾驶l 玩游戏玩游戏l 自然语言理解自然语言理解l“不擅长不擅长”：乘以：乘以2运算，运算，记住一个记住一个电话号码等电话号码等。深度学习：为什么不是其它学习方法？深度学习：为什么不是其它学习方法？l 线性回归线性回归(Linear re

3、gression)?(Linear regression)?LinearLinear约束成立吗约束成立吗?l 贝叶斯学习贝叶斯学习?先验先验(prior)prior)是什么是什么?l SVM?SVM?特征是什么特征是什么?l 决策树决策树(Decision tree)?(Decision tree)?什么是节点或属性值什么是节点或属性值(nodes/variablesnodes/variables)?)?l PACPAC学习学习?学习的函数是什么学习的函数是什么?l KNN?KNN?对什么特征进行聚类对什么特征进行聚类?这些方法不适合非常复杂的模型！这些方法不适合非常复杂的模型！感知器结构感知

4、器结构 f1 f2 f3featuresInputsx1x2x3By hand!LearningAs long as you pick right features,this can learn almost anything.Binary Threshold Neuron 实际上提供了一个强大的机器学习范例实际上提供了一个强大的机器学习范例l 选择正确的特征选择正确的特征l 线性分离这些特征线性分离这些特征l 这实际上是这实际上是RosenblattRosenblatt最初为感知器所制定要求的。最初为感知器所制定要求的。ChomskyChomskyPapertPapert实际上实现了另一个目

5、标。实际上实现了另一个目标。1+1+11FW二进制阈值神经元二进制阈值神经元lMcCulloch-Pitts(1943)There are two ways of describing the binary threshold neuron:1.Threshold=02.Threshold 0避免分开学习偏置避免分开学习偏置l通过将它巧妙地添加通过将它巧妙地添加到输入到输入中中l现在现在可以学习偏置，可以学习偏置，就好像它是一种就好像它是一种体重体重l现在现在可以去除了阀值可以去除了阀值了。了。b1x1x2w1w2感知器学习算法的收敛性感知器学习算法的收敛性l如果输出单元正确，则保持其权重不变

6、。如果输出单元正确，则保持其权重不变。l如果输出单元不正确地输出零，则将输入如果输出单元不正确地输出零，则将输入向量添加到权重向量。向量添加到权重向量。l如果输出单元不正确地输出如果输出单元不正确地输出1，则从加权向，则从加权向量中减去输入向量。量中减去输入向量。如果如果存在这样的存在这样的“解解”，这可以保证找，这可以保证找到一组对所有训练都是正确的权重。到一组对所有训练都是正确的权重。权值空间权值空间l维度维度k是权重是权重w=(w1,wk)的数目的数目.l空间中的点表示权重向量空间中的点表示权重向量(w1,wk)作为作为其坐标。其坐标。l每个训练案例都通过原点表示为超平面（每个训练案例都

7、通过原点表示为超平面（假设我们将阈值移至偏置权重）假设我们将阈值移至偏置权重）权重必须位于这个超平面的一侧才能得到正确答案。.Remember dot product facts:a b=|a|b|cos(ab)=a1b1+a2b2+anbnThus,a b 0,if/2 ab /2 a b 0,if ab /2 or/2 ab 权值空间权值空间空间中的一个点代表一个权重向量，训练过程是一个通过原点的超平面，假设阈值由偏差表示。可行解的方向锥可行解的方向锥This is convex为了让所有训练过程都正确，我们需要在所有平面（代表训练案例）的“右侧”找到一个点。如果存在解区域，则该区域是锥形

8、并且是凸的A positive exampleA negative example感知器局限性感知器局限性l如果我们被允许手动选择特征，那么我们如果我们被允许手动选择特征，那么我们可以做任何事情。可以做任何事情。但这不是学习。但这不是学习。l如果我们不选择特征，那么如果我们不选择特征，那么Minsky和和Papert表明感知器不能做太多。表明感知器不能做太多。我们会看我们会看看这些证明。看这些证明。感知器无法学习感知器无法学习XORl我们证明二进制阈值输出单位不能做异或：我们证明二进制阈值输出单位不能做异或：Positive examples:(1,1)1;(0,0)1 Negative ex

9、amples:(1,0)0;(0,1)0l4个个输入输入-输出输出对给出对给出4个不等式，个不等式，T是阈值：是阈值：w1+w2 T,0 T w1+w2 2T w1 T,w2 T w1+w2 3-D tensor)CNN in KerasConvolutionMax PoolingConvolutionMax Poolinginput1-1-1-11-1-1-11-11-1-11-1-11-1There are 25 3x3 filters.Input_shape=(28,28,1)1:black/white,3:RGB28 x 28 pixels3-1-313Only modified th

10、e network structure and input format(vector-3-D array)CNN in KerasConvolutionMax PoolingConvolutionMax PoolingInput1 x 28 x 2825 x 26 x 2625 x 13 x 1350 x 11 x 1150 x 5 x 5How many parameters for each filter?How many parametersfor each filter?9225=25x9Only modified the network structure and input fo

11、rmat(vector-3-D array)CNN in KerasConvolutionMax PoolingConvolutionMax PoolingInput1 x 28 x 2825 x 26 x 2625 x 13 x 1350 x 11 x 1150 x 5 x 5Flattened1250Fully connected feedforward networkOutputAlphaGoNeuralNetwork(19 x 19 positions)Next move19 x 19 matrixBlack:1white:-1none:0Fully-connected feedfor

12、ward network can be usedBut CNN performs much betterAlphaGos policy networkNote:AlphaGo does not use Max Pooling.The following is quotation from their Nature article:CNNCNN在语音识别中应用在语音识别中应用TimeFrequencySpectrogramCNNImageThe filters move in the frequency direction.CNNCNN在文本识别中应用在文本识别中应用Source of imag

13、e:http:/citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.703.6858&rep=rep1&type=pdf?内容概要内容概要l感知器介绍感知器介绍l深度学习基础深度学习基础l全连接网络全连接网络l深度学习深度学习VsVs浅层学习浅层学习l卷积神经网络卷积神经网络l生成对抗网络生成对抗网络l深度学习网络深度学习网络BPBP训练算法训练算法生成对抗网络生成对抗网络l GAN GAN 在在20142014由由Ian GoodfellowIan Goodfellow等人提出的等人提出的l 已用于生成图像，视频，诗歌，一些简单的谈话。已用于生

14、成图像，视频，诗歌，一些简单的谈话。l 请注意，图像处理很容易（所有动物都可以做到），请注意，图像处理很容易（所有动物都可以做到），NLPNLP很难（只有很难（只有人类可以做到）。人类可以做到）。l 这种共同进化的方法可能会产生深远的影响。这种共同进化的方法可能会产生深远的影响。Yoshua Bengio:Yoshua Bengio:这可这可能成为让电脑更加智能化的关键。能成为让电脑更加智能化的关键。.l Ian Goodfellow:https:/ Radford,(generate voices also here)https:/ Tips for training GAN:https:/

15、 close as possibleNNEncoderNNDecodercodeNNDecodercodeRandomly generate a vector as codeImage?拥有拥有3 3个全连接层的自编码器网络个全连接层的自编码器网络Large small,learn to compressTraining:model.fit(X,X)Cost function:k=1.N(xk xk)2自编码器网络自编码器网络NNDecodercode2D code-1.51.5NNDecoderNNDecoder自编码器网络自编码器网络-1.51.5NNEncoderNNDecodercod

16、einputoutput自编码器自编码器VAENNEncoderinputNNDecoderoutputm1m2m3From a normal distributionX+Minimize reconstruction errorexpMinimizeAuto-Encoding Variational Bayes,https:/arxiv.org/abs/1312.6114123e3e2e1ci=exp(i)ei+mii=1.3 exp(i)(1+i)+(mi)2 c3c2c1This constrains i approacing 0 is good VAEVAE的问题的问题l 它并不真正尝

17、试模拟真实图像它并不真正尝试模拟真实图像NNDecodercodeOutputAs close as possibleOne pixel difference to the targetAlso one pixel difference to the targetRealisticFakeVAE treats these the same逐步产生逐步产生NNGeneratorv1Discri-minatorv1Real images:NNGeneratorv2Discri-minatorv2NNGeneratorv3Discri-minatorv3GeneratedimagesThese ar

18、eBinary classifiersGANGAN学习鉴别器学习鉴别器NNGeneratorv1Real imagesSampled from DB:Discri-minatorv1image1/0(real or fake)Something like Decoder in VAERandomly sample a vector11110000GANGAN学习生成器学习生成器Discri-minatorv1NNGeneratorv1Randomly sample a vector0.13Updating the parameters of generator The output be cl

19、assified as“real”(as close to 1 as possible)Generator+Discriminator=a networkUsing gradient descent to update the parameters in the generator,but fix the discriminator1.0v2Train thisDo notTrainThisThey haveOppositeobjectivesGenerating 2nd element figuresSource of images:https:/ Dr.HY Lees notes.DCGA

20、N:https:/ can use the following to start a project(but this is in Chinese):GAN generating 2nd element figures100 roundsThis is fast,I think you can use your CPUGAN generating 2nd element figures1000 roundsGAN generating 2nd element figures2000 roundsGAN generating 2nd element figures5000 roundsGAN g

21、enerating 2nd element figures10,000 roundsGAN generating 2nd element figures20,000 roundsGAN generating 2nd element figures50,000 roundsNext few images from Goodfellow lectureTraditional mean-squaredError,averaged,blurryLast 2 are by deep learning approaches.Similar to word embedding(DCGAN paper)256

22、x256 high resolution picturesby Plug and Play generative network从自然语言转换为图片GANGAN推导：最大似然估计推导：最大似然估计l Give a data distribution Pdata(x)l We use a distribution PG(x;)parameterized by to approximate it E.g.PG(x;)is a Gaussian Mixture Model,where contains means and variances of the Gaussians.We wish to f

23、ind s.t.PG(x;)is close to Pdata(x)l In order to do this,we can sample x1,x2,xm from Pdata(x)l The likelihood of generating these xis under PG is L=i=1m PG(xi;)l Then we can find*maximizing the L.KL(Kullback-Leibler)散度l Discrete:DKL(P|Q)=iP(i)logP(i)/Q(i)l Continuous:DKL(P|Q)=p(x)log p(x)/q(x)l Expla

24、nations:Entropy:-iP(i)logP(i)-expected code length(also optimal)Cross Entropy:-iP(i)log Q(i)expected coding length using optimal code for Q DKL=iP(i)logP(i)/Q(i)=iP(i)logP(i)logQ(i),extra bits JSD(P|Q)=DKL(P|M)+DKL(Q|M),M=(P+Q),symmetric KL*JSD=Jensen-Shannon Divergency 最大似然估计*=arg max i=1.mPG(xi;)a

25、rg max log i=1.mPG(xi;)=arg max i=1.m log PG(xi;),x1,.,xm sampled from Pdata(x)=arg max i=1.m Pdata(xi)log PG(xi;)-this is cross entropy arg max i=1.m Pdata(xi)log PG(xi;)-i=1.m Pdata(xi)logPdata(x i)=arg min KL(Pdata(x)|PG(x;)-this is KL divergenceNote:PG is Gaussian mixture model,finding best will

26、 still be Gaussians,this only can generate a few blubs.Thus this above maximum likelihood approach does not work well.Next we will introduce GAN that will change PG,not just estimating PG is parameters We will find best PG,which is more complicated and structured,to approximate Pdata.https:/ to comp

27、ute the likelihood?使用NN作为PG(x;)Pdata(x)GSmallerdimensionLargerdimensionPrior distribution of zPG(x)=Integrationz Pprior(z)IG(z)=xdzGAN的基本理念l Generator G G is a function,input z,output x Given a prior distribution Pprior(z),a probability distribution PG(x)is defined by function Gl Discriminator D D i

28、s a function,input x,output scalar Evaluate the“difference”between PG(x)and Pdata(x)l In order for D to find difference between Pdata from PG,we need a cost function V(G,D):G*=arg minGmaxDV(G,D)Note,we are changing distribution G,not just update its parameters(as in the max likelihood case).Hard to

29、learn PG by maximum likelihood基本理念G*=arg minGmaxD V(G,D)V(G1,D)V(G3,D)V(G2,D)G1G2G3Given a generator G,maxDV(G,D)evaluates the“difference”between PG and PdataPick JSD function:V=ExP_data log D(x)+ExP_Glog(1-D(x)Pick the G s.t.PG is most similar to PdatalGiven G,what is the optimal D*maximizinglGiven

30、 x,the optimal D*maximizing is:f(D)=alogD+blog(1-D)D*=a/(a+b)Assuming D(x)can have any value hereMaxDV(G,D),G*=arg minGmaxDV(G,D)V=ExP_data log D(x)+ExP_Glog(1-D(x)=Pdata(x)log D(x)+PG(x)log(1-D(x)Thus:D*(x)=Pdata(x)/(Pdata(x)+PG(x)V(G1,D)V(G1,D*1)“difference”between PG1 and Pdata maxDV(G,D),G*=arg

31、minGmaxD V(G,D)D1*(x)=Pdata(x)/(Pdata(x)+PG_1(x)D2*(x)=Pdata(x)/(Pdata(x)+PG_2(x)V(G2,D)V(G3,D)maxDV(G,D)V=ExP_data log D(x)+ExP_Glog(1-D(x)maxD V(G,D)=V(G,D*),where D*(x)=Pdata/(Pdata+PG),and 1-D*(x)=PG/(Pdata+PG)=ExP_data log D*(x)+ExP_G log(1-D*(x)Pdata(x)log D*(x)+PG(x)log(1-D*(x)=-2log2+2 JSD(P

32、data|PG),JSD(P|Q)=Jensen-Shannon divergence =DKL(P|M)+DKL(Q|M)where M=(P+Q).DKL(P|Q)=P(x)log P(x)/Q(x)总结:l Generator G,Discriminator Dl Looking for G*such thatl Given G,maxD V(G,D)=-2log2+2JSD(Pdata(x)|PG(x)l What is the optimal G?It is G that makes JSD smallest=0:PG(x)=Pdata(x)V=ExP_data log D(x)+E

33、xP_Glog(1-D(x)G*=arg minGmaxD V(G,D)算法：l To find the best G minimizing the loss function L(G):G G =L(G)/G ,G defines Gl Solved by gradient descent.Having max ok.Consider simple case:f(x)=max D1(x),D2（x),D3(x)dD1(x)/dxdD2(x)/dxdD3(x)/dxIf Di(x)is the Max in that region,then do dDi(x)/dxL(G),this is t

34、he loss functionG*=arg minGmaxD V(G,D)D1(x)D2(x)D3(x)算法G*=arg minGmaxD V(G,D)L(G)l Given G0l Find D*0 maximizing V(G0,D)V(G0,D0*)is the JS divergence between Pdata(x)and PG0(x)l G G V(G,D0*)/G Obtaining G1(decrease JSD)l Find D1*maximizing V(G1,D)V(G1,D1*)is the JS divergence between Pdata(x)and PG1

35、(x)l G G V(G,D1*)/G Obtaining G2(decrease JSD)l And so on 在实际应用中Minimize Cross-entropyThis is what a Binary Classifier doOutput is D(x)Minimize log D(x)If x is a positive exampleIf x is a negative exampleMinimize log(1-D(x)V=ExP_data log D(x)+ExP_Glog(1-D(x)l Given G,how to compute maxDV(G,D)?Sample

36、 x1,xm from PdataSample x*1,x*m from generator PGMaximize:V=1/m i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)Positive exampleD must acceptNegative exampleD must rejectx1,x2,xm from Pdata(x)D is a binary classifier(can be deep)with parameters dPositive examplesNegative examplesMaximizeMinimize L=-VMinimize C

37、ross-entropyBinary ClassifierOutput is f(x)Minimize log f(x)If x is a positive exampleIf x is a negative exampleMinimize log(1-f(x)x*1,x*2,x*m from PG(x)V=i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)or算法算法Repeat k timesLearning DLearning GInitialize d for D and g for GCan only find lower bound of JSD or ma

38、xDV(G,D)Only OncelIn each training iterationSample m examples x1,x2,xm from data distribution Pdata(x)Sample m noise samples z1,zm from a simple prior Pprior(z)Obtain generated data x*1,x*m,x*i=G(zi)Update discriminator parameters d to maximize l V 1/m i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)l d d+V(d)

39、(gradient ascent)Simple another m noise samples z1,z2,zm from the prior Pprior(z)，G(zi)=x*iUpdate generator parameters g to minimize V=1/mi=1.m logD(xi)+1/m i=1.m log(1-D(x*i)g g V(g)(gradient descent)Ian Goodfellowcomment:thisis also done once在实际应用中的生成器目标函数Real implementation:label x from PG as pos

40、itiveTraining slow at the beginningV=ExP_data log D(x)+ExP_Glog(1-D(x)V=ExP_G log(D(x)训练GAN的一些问题M.Arjovsky,L.Bottou,Towards principled methods for training generative adversarial networks,2017.评估JS 散度Martin Arjovsky,Lon Bottou,Towards Principled Methods for Training Generative Adversarial Networks,2

41、017,arXiv preprintDiscriminator is too strong:for all threeGenerators,JSD=0评估JS 散度lJS divergence estimated by discriminator telling little informationhttps:/arxiv.org/abs/1701.07875Weak GeneratorStrong Generator鉴别器Reason 1.Approximate by sampling1 for all positive examples 0 for all negative example

42、s=0log 2 when Pdata and PG differ completelyWeaken your discriminator?Can weak discriminator compute JS divergence?V=ExP_data log D(x)+ExP_Glog(1-D(x)=1/m i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)maxDV(G,D)=-2log2+2 JSD(Pdata|PG)鉴别器Reason 2.the nature of data10=0log2V=ExP_data log D(x)+ExP_Glog(1-D(x)=1

43、/m i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)maxDV(G,D)=-2log2+2 JSD(Pdata|PG)Pdata(x)and PG(x)have very littleoverlap in high dimensional spaceTheoretical estimationGAN implementationestimation 0进化http:/ really better Pdata(x)Pdata(x)Pdata(x)JSD(PG_0|Pdata)=log2JSD(PG_100|Pdata)=0JSD(PG_50|Pdata)=log2简单

44、的解决方案：增加噪声简单的解决方案：增加噪声l Add some artificial noise to the inputs of discriminatorl Make the labels noisy for the discriminatorPdata(x)and PG(x)have some overlapDiscriminator cannot perfectly separate real and generated dataNoises need to decay over timeMode Collapse Data DistributionGenerated Distrib

45、utionSometimes,this is hard to tell since one sees only whats generated,but not whats missed.Converge to same facesMode Collapse Example 8 Gaussian distributions:What we want In reality PdataText to Image,by conditional GANText to Image-Resultsred flower with black centerProject topic:Code and data

46、are all on web,many possibilities!From CY Lee lectureAlgorithm Repeat k timesLearning DLearning GOnly OncelIn each training iterationSample m examples x1,x2,xm from data distribution Pdata(x)Sample m noise samples z1,zm from a simple prior Pprior(z)Obtain generated data x*1,x*m,x*i=G(zi)Update discr

47、iminator parameters d to maximize l V i=1.m logD(xi)+1/m i=1.m log(1-D(x*i)l d d+V(d)(gradient ascent plus weight clipping)lSimple another m noise samples z1,z2,zm from the prior Pprior(z)，G(zi)=x*iUpdate generator parameters g to minimize V=1/mi=1.m logD(xi)+1/m i=1.m log(1-D(x*i)g g V(g)(gradient

48、descent)Ian Goodfellowcomment:thisis also done onceWGANExperimental ResultslApproximate a mixture of Gaussians by single mixtureWGAN BackgroundlWe have seen that JSD does not give GAN a smooth and continuous improvement curve.lWe would like to find another distance which gives that.lThis is the Wass

49、erstein Distance or earth movers distance.Earth Movers Distancel Considering one distribution P as a pile of earth(total amount of earth is 1),and another distribution Q(another pile of earth)as the targetl The“earth movers distance”or“Wasserstein Distance”is the average distance the earth mover has

50、 to move the earth in an optimal plan.dEarth Movers Distance:best plan to movePQJS vs Earth Movers DistancePG_50d50W(PG_0,Pdata)=d0d0d100PG_0PdataPG_100PdataPdataJS(PG_0,Pdata)=log2JS(PG_50,Pdata)=log2JS(PG_100,Pdata)=0W(PG_100,Pdata)=0W(PG_50,Pdata)=d50Explaining WGANlLet W be the Wasserstein dista

展开阅读全文