1、Deep Learning TutorialThanks Hung-yi LeeDeep learning attracts lots of attention. Google TrendsDeep learning obtains many exciting results.20072009201120132015The talks in this afternoonThis talk will focus on the technical part.OutlinePart IV: Neural Network with MemoryPart III: Tips for Training D
2、eep Neural NetworkPart II: Why Deep?Part I: Introduction of Deep LearningPart I: Introduction of Deep LearningWhat people already knew in 1980s Example Application Handwriting Digit RecognitionMachine“2”Handwriting Digit RecognitionInputOutput16 x 16 = 2561x2x256xInk 1No ink 0y1y2y10Each dimension r
3、epresents the confidence of a digit.is 1is 2is 00.10.70.2The image is “2”Example Application Handwriting Digit RecognitionMachine“2”1x2x256xy1y2y10bwawawazKK2211Element of Neural Network z1w2wKw1a2aKab zbiasaActivation functionweightsNeuronOutput LayerHidden LayersInput LayerNeural NetworkInputOutpu
4、t1x2xLayer 1NxLayer 2Layer Ly1y2yMDeep means many hidden layersneuronExample of Neural Network zz zez11Sigmoid Function1-11-21-1104-20.980.12Example of Neural Network 1-21-1104-20.980.122-1-1-23-14-10.860.110.620.8300-221-1Example of Neural Network 1-21-1100.730.52-1-1-23-14-10.720.120.510.8500-22Di
5、fferent parameters define different function 00Matrix Operation2y1y1-21-1104-20.980.121-11x2xNxy1y2yMNeural Network W1W2WLb2bLxa1a2yb1W1x+b2W2a1+bLWL+aL-1b11x2xNxy1y2yMNeural Network W1W2WLb2bLxa1a2yyxb1W1x+b2W2+bLWL+b1Using parallel computing techniques to speed up matrix operationSoftmax Softmax l
6、ayer as the output layerOrdinary Layer 11zy 22zy 33zy1z2z3zIn general, the output of network can be any value.May not be easy to interpret Softmax Softmax layer as the output layer1z2z3zSoftmax Layereee1ze2ze3ze3111jzzjeey31jzje3-312.72.720200.050.050.880.1203122jzzjeey3133jzzjeeyHow to set network
7、parameters16 x 16 = 2561x2x256xInk 1No ink 0y1y2y100.10.70.2y1 has the maximum valueInput:y2 has the maximum valueInput:is 1is 2is 0How to let the neural network achieve thisSoftmaxTraining Data Preparing training data: images and their labelsUsing the training data to find the network parameters.“5
8、”“0”“4”“1”“3”“1”“2”“9”Cost1x2x256xy1y2y10Cost 0.20.30.5“1”100Cost can be Euclidean distance or cross entropy of the network output and target targetTotal Costx1x2xRNNNNNNy1y2yRx3NNy3For all training data Total Cost:Gradient DescentAssume there are only two parameters w1 and w2 in a network.The color
9、s represent the value of C.Error SurfaceGradient DescentEventually, we would reach a minima .Local Minima Gradient descent never guarantee global minima Reach different minima, so different resultsWho is Afraid of Non-Convex Loss Functions?http:/ local minima costparameter spaceVery slow at the plat
10、eauStuck at local minimaStuck at saddle pointIn physical world MomentumHow about put this phenomenon in gradient descent?MomentumcostMovement = Negative of Gradient + Momentum Gradient = 0Still not guarantee reaching global minima, but give some hope Negative of GradientMomentumReal MovementMini-bat
11、chx1NNy1x31NNy31x2NNy2x16NNy16Pick the 1st batchPick the 2nd batchMini-batchMini-batchC is different each time when we update parameters!Mini-batchOriginal Gradient DescentWith Mini-batchunstableThe colors represent the total C on all training data.Mini-batchx1NNy1x31NNy31x2NNy2x16NNy16Pick the 1st
12、batchPick the 2nd batchUntil all mini-batches have been pickedone epochFasterBetter!Mini-batchMini-batchRepeat the above processBackpropagation A network can have millions of parameters. Backpropagation is the way to compute the gradients efficiently (not today) Ref: http:/speech.ee.ntu.edu.tw/tlkag
13、k/courses/MLDS_2015_2/Lecture/DNN%20backprop.ecm.mp4/index.html Many toolkits can compute the gradients automaticallyRef: http:/speech.ee.ntu.edu.tw/tlkagk/courses/MLDS_2015_2/Lecture/Theano%20DNN.ecm.mp4/index.htmlPart II:Why Deep?Layer X SizeWord Error Rate (%)Layer X SizeWord Error Rate (%)1 X 2k
14、24.22 X 2k20.43 X 2k18.44 X 2k17.85 X 2k17.21 X 377222.57 X 2k17.11 X 463422.61 X 16k22.1Deeper is Better?Seide, Frank, Gang Li, and Dong Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech. 2011.Not surprised, more parameters, better performance Univers
15、ality TheoremReference for the reason: http:/ continuous function fM:RRfNCan be realized by a network with one hidden layer(given enough hidden neurons)Why “Deep” neural network not “Fat” neural network?Fat + Short v.s. Thin + Tall1x2xNxDeep1x2xNxShallowWhich one is better?The same number of paramet
16、ersFat + Short v.s. Thin + TallSeide, Frank, Gang Li, and Dong Yu. Conversational Speech Transcription Using Context-Dependent Deep Neural Networks. Interspeech. 2011.Layer X SizeWord Error Rate (%)Layer X SizeWord Error Rate (%)1 X 2k24.22 X 2k20.43 X 2k18.44 X 2k17.85 X 2k17.21 X 377222.57 X 2k17.
17、11 X 463422.61 X 16k22.1長髮男Why Deep? Deep ModularizationGirls with long hairBoys with short hair Boys with long hairImageClassifier 1Classifier 2Classifier 3長髮女長髮女長髮女長髮女Girls with short hair短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女Classifier 4Little examplesweakWhy Deep? Deep ModularizationImageLong or short?Boy or
18、Girl?Classifiers for the attributes長髮男長髮女長髮女長髮女長髮女短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女v.s.長髮男長髮女長髮女長髮女長髮女短髮女短髮男短髮男短髮男短髮男短髮女短髮女短髮女v.s.Each basic classifier can have sufficient training examples.Basic ClassifierWhy Deep? Deep ModularizationImageLong or short?Boy or Girl?Sharing by the following classifiers as mod
19、ulecan be trained by little dataGirls with long hairBoys with short hair Boys with long hairClassifier 1Classifier 2Classifier 3Girls with short hairClassifier 4Little datafineBasic ClassifierWhy Deep? Deep Modularization1x2xNxThe most basic classifiersUse 1st layer as module to build classifiers Us
20、e 2nd layer as module The modularization is automatically learned from data. Less training data?Deep Learning also works on small data set like TIMIT.Hand-crafted kernel functionSVMSource of image: http:/www.gipsa-lab.grenoble-inp.fr/transfert/seminaire/455_Kadri2013Gipsa-lab.pdfApply simple classif
21、ierDeep Learning1x2xNxy1y2yMsimple classifierLearnable kernelHard to get the power of Deep Before 2006, deeper usually does not imply better.Part III:Tips for Training DNNRecipe for Learninghttp:/.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/Recipe for Lear
22、ninghttp:/.au/2015/04/the-basic-recipe-for-machine-learning-explained-in-a-single-powerpoint-slide/overfittingDont forget!PreventingOverfittingModify the NetworkBetter optimization StrategyRecipe for Learning New activation functions, for example, ReLU or MaxoutModify the Network Adaptive learning r
23、atesBetter optimization Strategy DropoutPrevent OverfittingOnly use this approach when you already obtained good results on the training data.Part III:Tips for Training DNNNew Activation FunctionReLU Rectified Linear Unit (ReLU)Reason:1. Fast to compute2. Biological reason3. Infinite sigmoid with di
24、fferent biases4. Vanishing gradient problemXavier Glorot, AISTATS11Andrew L. Maas, ICML13Kaiming He, arXiv15Vanishing Gradient ProblemLarger gradientsAlmost randomAlready convergebased on random!?Learn very slowLearn very fast1x2xNxy1y2yMSmaller gradientsIn 2006, people used RBM pre-training.In 2015
25、, people use ReLU.Vanishing Gradient Problem1x2xNxIntuitive way to compute the gradient Smaller gradientsLarge inputSmall outputReLU1x2x1y2y0000ReLU1x2x1y2yA Thinner linear networkDo not have smaller gradientsMaxout Learnable activation function Ian J. Goodfellow, ICML13Max1x2xInputMax+MaxMax+ReLU i
26、s a special cases of MaxoutYou can have more than 2 elements in a group.neuronMaxout Learnable activation function Ian J. Goodfellow, ICML13 Activation function in maxout network can be any piecewise linear convex function How many pieces depending on how many elements in a group ReLU is a special c
27、ases of Maxout2 elements in a group3 elements in a groupPart III:Tips for Training DNNAdaptive Learning RateLearning RateIf learning rate is too largeCost may not decrease after each updateSet the learning rate carefullyLearning RateIf learning rate is too largeCost may not decrease after each updat
28、eSet the learning rate carefullyIf learning rate is too smallTraining would be too slowCan we give different parameters different learning rates?AdagradParameter dependent learning rateconstantOriginal Gradient DescentEach parameter w are considered separatelySummation of the square of the previous
29、derivativesAdagradg0g10.10.2g0g120.010.0Observation: 1. Learning rate is smaller and smaller for all parameters2. Smaller derivatives, larger learning rate, and vice versaWhy?Learning rate:Learning rate:Smaller DerivativesLarger Learning Rate2. Smaller derivatives, larger learning rate, and vice ver
30、saWhy?Smaller Learning RateLarger derivativesNot the whole story Adagrad John Duchi, JMLR11 RMSprop https:/ Adadelta Matthew D. Zeiler, arXiv12 Adam Diederik P. Kingma, ICLR15 AdaSecant Caglar Gulcehre, arXiv14 “No more pesky learning rates” Tom Schaul, arXiv12Part III:Tips for Training DNNDropoutDr
31、opoutTraining: Each time before computing the gradientsl Each neuron has p% to dropoutPick a mini-batchDropoutTraining: Each time before computing the gradientsl Each neuron has p% to dropoutl Using the new network for trainingThe structure of the network is changed.Thinner!For each mini-batch, we r
32、esample the dropout neuronsPick a mini-batchDropoutTesting: No dropoutl If the dropout rate at training is p%, all the weights times (1-p)%Dropout - Intuitive Reason When teams up, if everyone expect the partner will do the work, nothing will be done finally. However, if you know your partner will d
33、ropout, you will do better.我的 partner 會擺爛,所以我要好好做 When testing, no one dropout actually, so obtaining good results eventually.Dropout - Intuitive Reason Why the weights should multiply (1-p)% (dropout rate) when testing?Training of DropoutTesting of DropoutAssume dropout rate is 50%No dropoutWeights
34、 from trainingWeights multiply (1-p)%Dropout is a kind of ensemble.EnsembleNetwork1Network2Network3Network4Train a bunch of networks with different structuresTraining SetSet 1Set 2Set 3Set 4Dropout is a kind of ensemble.Ensembley1Network1Network2Network3Network4Testing data xy2y3y4averageDropout is
35、a kind of ensemble.Training of Dropoutminibatch 1Using one mini-batch to train one networkSome parameters in the network are sharedminibatch 2minibatch 3minibatch 4M neurons2M possible networksDropout is a kind of ensemble.testing data xTesting of Dropoutaveragey1y2y3All the weights multiply (1-p)%y
36、More about dropout More reference for dropout Nitish Srivastava, JMLR14 Pierre Baldi, NIPS13Geoffrey E. Hinton, arXiv12 Dropout works better with Maxout Ian J. Goodfellow, ICML13 Dropconnect Li Wan, ICML13 Dropout delete neurons Dropconnect deletes the connection between neurons Annealed dropout S.J
37、. Rennie, SLT14 Dropout rate decreases by epochs Standout J. Ba, NISP13 Each neural has different dropout ratePart IV:Neural Network with MemoryNeural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence.apple0001DNNpe
38、oplelocationorganizationnone0.50.30.10.1Neural Network needs Memory Name Entity Recognition Detecting named entities like name of people, locations, organization, etc. in a sentence.x1x2x3x4the president of apple eats an applex5x6x7ORGy1y2y3y4y5y6y7NONEDNNDNNDNNDNNDNNDNNDNNDNN needs memory!targettar
39、getRecurrent Neural Network (RNN)1x2x2y1y1a2aMemory can be considered as another input.The output of hidden layer are stored in the memory.copyRNNcopycopyx1x2x3y1y2y3Output yi depends on x1, x2, xiWiWoWhWhWiWiWoWoa1a1a2a2a3The same network is used again and again.L1L2L3RNNx1x2x3y1y2y3WiWoWhWhWiWoWiW
40、oBackpropagation through time (BPTT)How to train?targettargettargetFind the network parameters to minimize the total cost:Of course it can be deep xtxt+1xt+2ytyt+1yt+2Bidirectional RNNyt+1yt+2ytxtxt+1xt+2xtxt+1xt+2Many to Many (Output is shorter) Both input and output are both sequences, but the out
41、put is shorter. E.g. Speech Recognition好 好 好Trimming 棒 棒 棒 棒 棒“好棒”Why cant it be “好棒棒”Input:Output:(character sequence)(vector sequence)Problem?Many to Many (Output is shorter) Both input and output are both sequences, but the output is shorter. Connectionist Temporal Classification (CTC) Alex Grave
42、s, ICML06Alex Graves, ICML14Haim Sak, Interspeech15Jie Li, Interspeech15Andrew Senior, ASRU15好 棒 好 棒 棒 “好棒”“好棒棒”Add an extra symbol “” representing “null”Many to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translati
43、on (machine learning機器學習)Containing all information about input sequencelearningmachinelearningMany to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning機器學習)machine機習器學Dont know when to stop慣
44、性Many to Many (No Limitation)推 tlkagk: =斷=Ref:http:/ (鄉民百科) learningMany to Many (No Limitation) Both input and output are both sequences with different lengths. Sequence to sequence learning E.g. Machine Translation (machine learning機器學習)machine機習器學Add a symbol “=“ (斷)Ilya Sutskever, NIPS14Dzmitry
45、Bahdanau, arXiv15=Unfortunately RNN-based network is not always easy to learn感謝 曾柏翔 同學提供實驗結果Real experiments on Language modelingLuckysometimesThe error surface is rough.w1w2CostThe error surface is either very flat or very steep.Clipping Razvan Pascanu, ICML13Why? 11y101wy201wy301wy10001111Large gr
46、adientSmall Learning rate?small gradientLarge Learning rate?Toy Example=w999 Nesterovs Accelerated Gradient (NAG): Advance momentum method RMS Prop Advanced approach to give each parameter different learning rates Considering the change of Second derivatives Long Short-term Memory (LSTM) Can deal wi
47、th gradient vanishing (not gradient explode)Helpful TechniquesMemoryCell Long Short-term Memory (LSTM)Input GateOutput GateSignal control the input gateSignal control the output gateForget GateSignal control the forget gateOther part of the networkOther part of the network(Other part of the network)
48、(Other part of the network)(Other part of the network)LSTMSpecial Neuron:4 inputs, 1 outputmultiplymultiplyActivation function f is usually a sigmoid functionBetween 0 and 1Mimic open and close gatecx1x2InputOriginal Network:Simply replace the neurons with LSTMx1x2+Input4 times of parametersLSTMxtzz
49、izfzoytxt+1zzizfzoyt+1htExtension: “peephole”ht-1ctct-1ct-1ctct+1Other Simpler AlternativesVanilla RNN Initialized with Identity matrix + ReLU activation function Quoc V. Le, arXiv15 Outperform or be comparable with LSTM in 4 different tasksCho, EMNLP14Gated Recurrent Unit (GRU) Tomas Mikolov, ICLR1
50、5Structurally Constrained Recurrent Network (SCRN)What is the next wave? Attention-based ModelReading Head ControllerInput xReading HeadWriting HeadDNN/LSTMWriting Head Controlleroutput yInternal memory or information from outputAlready applied on speech recognition, caption generation, QA, visual Q