1、隐马尔科夫模型Hidden Markov Model(HMM)Hidden Markov ModeluThe problems about the Template methoduHMM is a popular statistical tooluDiscrete-Time Markov ProcessuTheory of HMM:The three basic problems2Review template methodnKey ideaTo derive typical sequences of speech frames for a pattern via some averaging
2、 procedureRely on the use of local spectral distance measures to compare patternsDynamic programming,temporally align patternsProblems of Template methodu语音是一个随机信号u非严格意义上的统计方法Statistical techniques have been widely used in clustering to create reference patternsStatistical signal characterization in
3、herent in the template representation is only implicit and often inadequate:neglects the second-order statisticsu缺乏鲁棒性4HMM:popular toolnThe basic theory of HMM was published in a series of classic papers by Baum and his colleagues in the late 1960s and early 1970s nHMM was implemented for speech-pro
4、cessing applications by Baker at CMU,and by Jelinek and his colleagues at IBM in the 1970snHMM provides a natural and highly reliable way of recognizing speech for a wide range of applications5HMM:popular toolnThe underlying assumption of the HMMnthe speech signal can be well characterized as a para
5、metric random processnthe parameters of the stochastic process can be determined in a precise,well-defined manner6Discrete-Time Markov ProcessnA system with N discrete states indexed by 1,2,N.tq:The state at time t22a33a44a55a11a21a32a13a35a34a41a51a45a54a123457Discrete-Time Markov Process121(|,.)(|
6、)tttttp qj qi qkp qj qi81(|),1,ijttaP qj qii jN10,1Nijijjaj iai 时不变系统?时不变系统?Observable Markov ModelnEach state corresponds to an observable eventExample:weatherState1:rain or snowState2:cloudyState3:sunny9The weather is observed once a dayCould it be used for what case?Extensions to Hidden Markov Mo
7、dels-The Urn-and-Ball ModelnN glass urns,each with M distinct color ballsnA urn is randomly selected first,and then a ball is chosen at random,whose color is recorded as the observationnThe ball is then replaced in the urn from which it was selectednThe procedure is repeated 102.2.The Urn-and-Ball M
8、odel11HMM for weather forecastnWhat Operations do you design to carry out the ball selection?nHow do you extend the Markov process to HMM to give more precise weather forecast?Theory of HMMnTopologynElementsnBi-hidden processesnThree basic problems13HMM Topology:Ergodic123414HMM Topology:Left-right1
9、23415Parallel path left-right HMM12345616Elements of HMM1.N-每个模型的状态数2.M-每个状态的可观察现象数3.状态转移概率分布,其中4.状态观察现象概率分布5.初始状态概率分布,其中 ijaAiqjqPattij1Nji ,1)(oBjb iiqPi1Ni 1n we use the compact notation To indicate the complete parameter set of the model,this parameter set,of course,defines a probability measure
10、 for O,which we discuss later,we use the terminology HMM to indicate the parameter set and the associated probability measure interchangeably without ambiguity.(,)A B(|)P OElements of HMM18Bi-Hidden processesuThe statesuThe observations19The Three Basic ProblemsnEvaluation:Forward processnOptimal pa
11、th:Viterbi AlgorithmnTraining:Baum-Welch Algorithm20Problem1:Given the observation sequence O=(o1,o2,oT),and a model how do we efficiently compute ,the probability of the observation sequence,given the model?We can also view the problem as one of scoring how well a given model matches a given observ
12、ation sequence.To solve the problem allows us to choose the model that best matches the observations.(,)A B(|)P OEvaluation21nProblem2 Given the observation sequence O=(o1,o2,oT),and the model how do we choose a corresponding static sequence q=(q1q2,qt)that is optimal in some sense.in this problem t
13、o find the correct state sequence.we usually use an optimality criterion to solve this problem as best as possible.Evaluation22(,)A BnProblem3:How do we adjust the model parameters to maximize (|)P OIn this problem we attempt to optimize the model parameters to best describe how a given observation
14、sequence comes about.The observation sequence used to adjust the model parameters is called a training sequence because it is used to train the HMM.(,)A BEvaluation23Probability EvaluationnWe wish to calculate the probability of the observation sequence.Consider one such fixed-state sequencenWhere q
15、1 is the initial state.The probability of the observation sequence O given the state sequence of q is nWhere we have assumed statistical independence of observation.Thus we get 12(.)Tqq qq1(|,)(|,)TtttP O qP oq1212(|,)()().()TqqqTP O qbobobo24Probability EvaluationnThe probability of such a state se
16、quence q can be written as nThe joint probability of O and q occur simultaneously,is simply the product of the above two terms11 22 31(|).TTqq qq qqqP qaaa(,|)(|,)(|)P O qP O qP q25Probability EvaluationnThe probability of O is obtained by summing this joint probability over all possible state seque
17、nce q,giving111 22112all 12,.(|)(|,)(|)()().()TTTTqqqq qqqqqTq qqP OP O qP qb o aboabo26A.The Forward ProcedurenConsider the forward variable defined as nThat is,the probability of the partial observation sequence,o1o2ot,(until time t)and state i at time t,given the model .We can solve for inductive
18、ly,as follows:12()(.,|)tttiP o oo qi27()ti()tiForward Procedure1.initialization 2.induction 3.termination11()()1iiib oiN 11=1()()()1,11Nttijjtiji ab ojNtT 1(|)()NTiP Oi28B.The Backward ProcedurenIn a similar manner,we can consider a backward variable defined as nThat is,the probability of the partia
19、l observation sequence from t+1 to the end,given state i at time t and the model nAgain we can solve for inductively,as Follows:()ti12()(,|,)tttTtiP oooqi()ti29Backward Procedure1.initialization 2.Induction()11TiiN 111()()()1,2,1,1Nti jjttjia b ojtTTiN 30Backward procedurenThe initialization step 1
20、arbitrarily define to be 1 for all i.nStep 2,which is illustrated in next figure,which shows that in order to have been in state i at time t,and to account for the observation sequence from time t+1 on,you have to consider all possible state j at time t+1()ti31naccording for the transition from i to
21、 j,as well as the observation ot+1 in state j.And then account for the remaining partial observation sequence from state j.We will see alter how the backward as well as the forward calculation are used to help solve fundamental problem 2 and 3 of HMMsBackward procedure32ai3ai2ai1aiNs1s2s3sNt+1t()ti1
22、()tjsiBackward procedure33nThere are several possible ways of solving problem 2,finding the“optimal”state sequence associated with the given observation sequence.To implement this problem 2,we can define that a posteriori probability variable()(|,)ttiP qi OBackward procedure34 That is,the probabilit
23、y of being in state i at time t,given the observation sequence O,and the model,we can express in several forms,including()ti1(,|)(,|)()(|,)(|)(,|)ttttNtiP O qiP O qiiP qi OP OP O qiBackward procedure35nSince is equal to we can write as(,|)tP O qi()()ttii()ti1()()()()()tttNttiiiiiiBackward procedure3
24、6nWhere we see that accounts for the partial observation sequence and state i at t,while account for the remainder of the observation sequence ,given state Using ,we can solve for the individually most likely state at time t,as()ti12.tooo()ti12.ttTo oo.tqi at t()titq1argmin()1tti NqitT Backward proc
25、edure37A.The Viterbi AlgorithmnTo find the single best state sequence,q=(q1q2qT),for the given observation sequence O=(o1o2oT),we need to define the quantity1 211211 2()max(.,.|)tttttq qqiP q qqqi ooo38Viterbi AlgorithmnThat is,is the best score along a single path,at time t,which accounts for the f
26、irst t observations and ends in state i,by induction we have()ti11()max()()ttijjtiji ab o39Viterbi AlgorithmnThe complete procedure for finding the best state sequence can now be stated as follows:1.Initialization111()()1()0iiib oiNi 40Viterbi Algorithm2.Recursion3.Termination1111()max()()1,2()argma
27、x()1,2ttijjti Nttiji Nji a b ojNtTji ajNtT 11max()arg max()TiNTTiNPiqi 41Viterbi Algorithm4.Path(state sequence)backtrackingnIt should be noted that the Viterbi algorithm is similar in implementation to the forward calculation.11(),1,2,.,1tttqqtTT 42B.Alternative Viterbi ImplementationnBy taking log
28、arithms of the model parameters,the Viterbi algorithm of the preceding section can be implemented without the need for any multiplications,thus:43Viterbi Algorithm0.Preprocessinglog()1()log()1,1log()1,iiititijijiNb ob oiNtTaai jN 44Viterbi Algorithm1.Initialization2.Recursion 1111()log()()1()01iiiib
29、 oiNiiN 1111()log()max()()()argmax()2,1tttijjti Nttiji Njjiab ojiat TjN 45Viterbi Algorithm3.Termination4.Backtracking 11m a x()a rg m a x()TiNTTiNPiqi11(),1,2,.,1tttqqtTT 46time-series modeling-声学统计模型(语音识别)语言模型通信系统生物信号处理手写字符识别面部识别Feature extraction(Ferdinando Samaria etc.at Olivetti Research,Ltd)手势
30、识别一、一、HMM应用领域应用领域HMM的应用的应用471.1HMM在生物信号处理中的应用在生物信号处理中的应用For protein and nucleic acid sequence analysis(Washington University)The recognition of Human Genes in DNA(University of California)Detecting Remote Protein Homologies(UCSC)Estimating amino acid distributions481.2 HMM应用与手势识别应用与手势识别nHand motion
31、is an effective means of human communications in real worldQuantizationPosture recogntionCalculatefingertip/palmcenterposition/orientationCalculaterelativeposition/orientLeft HandJoint SensorsPosition SensorRight HandJoint SensorsPosition SensorCalculatefingertip/palmcenterposition/orientationQuanti
32、zationPosture recogntionDynamic gesturerecognitionMotionrecognition/quantization(HMM)Gesturedescribtion FileGestureoutputmotionoutputPositonoutputGesture recognition system diagram49二、二、HMM的训练标准的训练标准ML-Maximum LikelihoodMMI-Minimum discrimination informationMDIMaximum mutual informationMMDMaximum mo
33、del distanceCT Corrective TrainingMCE Minimum classification Error50nThe standard ML design criterion is to use a training sequence of observations O to derive the set of model parameters ,yieldingnAny of the reestimation algorithms discussed previously provides a solution to this optimization probl
34、em.ML-Maximum Likelihoodargmax(|)MLP O51nThe minimum discrimination information(MDI)is a measure of closeness between two probability measures under the given constraint R is defined by nWhere MDIMaximum mutual information()(,)inf(:)QRv R PI Q P()(:)()log(|)q OI Q Pq OdOp O52nThe standard ML criteri
35、on is to use to estimate model parameters ,yieldingnThe mutual information between an observation sequence and the word v,parameterized by n ,isnThe MMI criterion is to find the entire model set such that the mutual information is maximized,MMI Minimum discrimination information vOv()argmin(|)vvMLP
36、O,1,2,.,vvV(,|)(,)log()()vvvP O vIO vP OP v1()max(,)VvMMIvIO v vO53三、三、HMM的应用问题的应用问题1.Scaling2.Multiple Observation Sequences3.Initial Estimates of HMM parameters.4.Effects of Insufficient Training Data5.Choice of Model54nInitially,for t=1,we setnFor each t,in terms of the previously scalednThat is,
37、nWe determine the scaling coefficient asnGiving 3.1 Scaling111111111()(),()()()Niiicicii2tT 1()i11()()()Nttjiitjij a b otc11()tNtici()()tttici()()tttici55nEach nEach nSo in terms of the scaled variables,we getnFinally the term can be seen to be of the form3.1 Scaling()ti1()()()ttstttsiciCi1()tj11111
38、()()()Ttsttts tjcjDj 11111111111()()()()()()TttijjttttijTNttijjttttjCi a b oDjaCi a b oDj1ttC D1111tTTttsssTss tsC DcccC 56nThe only real change to the HMM procedure because of scaling is the procedure for computing .We cannot merely sum up the terms,because these are scaled already.However,we can u
39、se the property thatnThus we haven orn or3.1 Scaling(|)P()Ti111()()1TNNtTTTiitciCi1.(|)1Tttc P11(|)TttPc1log(|)logTtiPc57nThe major problem with left-right models is hat one cannot use a single observation sequence to train the model.This is because the transient nature of the states within the mode
40、l allows only a small number of observations for any state.nHence,to have sufficient data to make reliable estimates of all model parameters,one has to use multiple observation sequences.3.2 Multiple Observation Sequences()11(|)(|)KKkkkkPPP111111111()()()1()()kkTKkkktijjttktkijTKkkttktki a b ojPaijP
41、58nHow do we choose initial estimates of the HMM parameters so that the local maximum is equal to or as close as possible to the global maximum of the likelihood function?nExperience has shown that either random or uniform initial estimates of the and A parameters are adequate for giving useful rees
42、timates of these parameters in almost all cases.nHowever,for the B experiences has shown that Good initial estimates are helpful in the discrete symbol case and are essential in the continuous-distribution case.3.3 Initial Estimates of HMM parameters594.HMM system for Isolated Word Recognition1.Choi
43、ce of Model Parameters2.Segmental k-means segmentation with clustering.3.Incorporation of Sate Duration into the HMM4.HMM Isolated-Digit Performance60To do isolated word speech recognition,we must perform the following:1.For each word v in the vocabulary,we must build an HMM -that is,we must estimat
44、e the model parameter(A,B,)that optimize the likelihood of the training set observation vectors for the vth word.2.For each unknown word to be recognized,the processing shown in Figure4.1 must be carried out,namely,measurement of the observation sequence ,via a feature analysis of the speech corresp
45、onding to the word;followed by calculation of model likelihoods for all possible models,;followed by selection of the word whose model likelihood is highestthat is,4.1 HMM Recognizer of Isolated Wordsv12.TOooo(|),1vPvV*1argmax(|)vv VvP 61Block diagram of an isolated word HMM recognizer62nThe figure
46、shows a plot of average word error rate versus N,for the case of recognition of isolated digits.It can be seen that the error is somewhat insensitive to N,achieving a local minimum at N=6;however,differences in error rate for values of N close to 6 are small.4.2 Choice of Model ParametersAverage wor
47、d error rate(for a digits vocabulary)versus the number of states N in the HMM(after Rabiner et al.18)63nThe figure shows a comparison of marginal distributions against a histogram of the actual observations within a state.The observation vectors are ninth order,and the model density uses M=5 mixture
48、s.The covariancematrices are constrained to be diagonal for each individual mixture.The results of the figure are for the first model state of the word“zero.”()jb o()jb o4.2 Choice of Model Parameters64 nfigure shows a curve of average word error rate versus the parameter (on a log scale)for a stand
49、ard word-recognition experiment.It can be seen that over a very broad range()the average error rate remains at about a constant value;however,when is set to 0(i.e.,),then the error rate increases sharply.nSimilarly,for continuous densities it is important to constrain the mixture gains as well as th
50、e diagonal covariance coefficients to be greater than or equal to some minimum values.103101010jmc(,)jmUr r4.2 Choice of Model Parameters65nThe Figure(next page)shows a log-energy plot,an accumulated log-likelihood plot,and a state segmentation for one occurrence of the word“six.”The states correspo