1、Gradient DescentReview:Gradient Descent In step 3,we have to solve the following optimization problem:L:loss functionSuppose that has two variables 1,2Review:Gradient Descent MovementGradientGradient:Loss 的等高線的法線方向Gradient DescentTip 1:Tuning your learning ratesLearning RateNo.of parameters updatesL
2、ossLossVery LargeLargesmallJust make11iiiLSet the learning rate carefullyIf there are more than three parameters,you cannot visualize this.But you can always visualize this.Adaptive Learning RatesAdagrad Divide the learning rate of each parameter by the root mean square of its previous derivativesw
3、is one parametersVanilla Gradient descentAdagradParameter dependentAdagradAdagrad Divide the learning rate of each parameter by the root mean square of its previous derivatives1/t decayContradiction?Vanilla Gradient descentAdagradLarger gradient,larger stepLarger gradient,smaller stepLarger gradient
4、,larger stepIntuitive Reason How surprise it is造成反差的效果g0g1g2g3g40.001 0.001 0.003 0.002 0.1g0g1g2g3g410.820.931.712.10.1反差特別大特別小Larger gradient,larger steps?Best step:Larger 1st order derivative means far from the minimaComparison between different parametersabcdc da bLarger 1st order derivative mea
5、ns far from the minimaDo not cross parametersSecond DerivativeBest step:The best step is|First derivative|Second derivativeComparison between different parametersabcdc da bLarger 1st order derivative means far from the minimaDo not cross parameters|First derivative|Second derivativeThe best step isS
6、maller SecondLargerSecondLarger second derivativesmaller second derivative|First derivative|Second derivativeThe best step isUse first derivative to estimate second derivativelarger second derivativesmaller second derivative?Gradient DescentTip 2:Stochastic Gradient Descent Make the training fasterS
7、tochastic Gradient DescentuGradient DescentuStochastic Gradient Descent11iiiL11iniiLPick an example xnFaster!Loss is the summation over all training examplesLoss for only one example DemoStochastic Gradient DescentGradient DescentStochastic Gradient DescentSee all examplesSee all examplesSee only on
8、e exampleUpdate after seeing all examplesIf there are 20 examples,20 times faster.Update for each exampleGradient DescentTip 3:Feature ScalingFeature ScalingMake different features have the same scalingSource of figure:http:/cs231n.github.io/neural-networks-2/Feature Scalingy1w2w1x2xb1,2 100,200 1w2
9、wLoss Ly1w2w1x2xb1,2 1w2wLoss L1,2 Feature ScalingThe means of all dimensions are 0,and the variances are all 1 For each dimension i:Gradient DescentTheoryQuestionby gradient descentIs this statement correct?Warning of Math12Formal Derivation Suppose that has two variables 1,2012How?L()Given a point
10、,we can easily find the point with the smallest value nearby.Taylor Series Taylor series:Let h(x)be any function infinitely differentiable around x=x0.kkkxxkx000!hxh 200000!2xxxhxxxhxhWhen x is close to x0 000 xxxhxhxhsin(x)=E.g.Taylor series for h(x)=sin(x)around x0=/4The approximation is good arou
11、nd/4.Multivariable Taylor Series00000000,yyyyxhxxxyxhyxhyxhWhen x and y is close to x0 and y000000000,yyyyxhxxxyxhyxhyxh+something related to(x-x0)2 and(y-y0)2+Back to Formal Derivation12ba,bbaababa2211,L,L,LLBased on Taylor Series:If the red circle is small enough,in the red circle21,L,Lbavbau bvau
12、s21Lbas,LL()12Back to Formal DerivationBased on Taylor Series:If the red circle is small enough,in the red circle bvaus21Lbas,Lba,L()Find 1 and 2 in the red circle minimizing L()22221dba21,L,LbavbauconstantdSimple,right?Gradient descent two variablesRed Circle:(If the radius is small)bvaus21L121221,
13、vu,vu21To minimize L()vuba21Find 1 and 2 in the red circle minimizing L()22221dba21,Back to Formal Derivationvuba2121,C,CbababaThis is gradient descent.Based on Taylor Series:If the red circle is small enough,in the red circle bvaus21Lbas,L21,L,LbavbauconstantNot satisfied if the red circle(learning rate)is not small enoughYou can consider the second order term,e.g.Newtons method.End of Warning More Limitation of Gradient DescentLossThe value of the parameter wVery slow at the plateauStuck at local minimaStuck at saddle point