1、Digital Speech Processing李琳山李琳山Speech Signal ProcessingMajor Application Areas1.Speech Coding:Digitization and Compression2.3.Considerations:1)bit rate(bps)2)recovered quality 3)computation complexity/feasibility2.Voice-based Network Access 3.User Interface,Content Analysis,User-content InteractionL
2、PFoutputProcessing Algorithmsx(t)xnProcessingxk110101Inverse ProcessingxnxnStorage/transmission Speech Signals Carrying Linguistic Knowledge and Human Information:Characters,Words,Phrases,Sentences,Concepts,etc.Double Levels of Information:Acoustic Signal Level/Symbolic or Linguistic Level Processin
3、g and Interaction of the Double-level InformationSpeech Signal Processing Processing of Double-Level Information Speech Signal Sampling Processing Linguistic Structure Linguistic Knowledge 今 天 的 常 好Lexicon Grammar今天 的 今天的 天氣非常 好AlgorithmChips or Computers 天 氣 非Voice-based Network AccessContent Analy
4、sisUser InterfaceInternetUser-Content InteractionlUser Interface when keyboards/mice inadequatelContent Analysis help in browsing/retrieval of multimedia contentlUser-Content Interaction all text-based interaction can be accomplished by spoken languageUser Interface Wireless Communications Technolog
5、ies are Creating a Whole Variety of User Terminalslat Any Time,from AnywherelHandsets,Hand-held Devices,PDAs,Personal Notebooks,Vehicular Electronics,Hands-free Interfaces,Home Appliances,Wearable DeviceslSmall in Size,Light in Weight,Ubiquitous,InvisiblelEvolving towards a“Post-PC Era”lKeyboard/Mou
6、se Most Convenient for PCs not Convenient any longer human fingers never shrink,and application environment is changedlService Requirements Growing ExponentiallylVoice is the Only Interface Convenient for ALL User Terminals at Any Time,from Anywhere Internet NetworksText ContentMultimedia ContentCon
7、tent AnalysisMultimedia Technologies are Creating a New World of Multimedia Content Most Attractive Form of the Network Content will be in Multimedia,which usually Includes Speech Information(but Probably not Text)Multimedia Content Difficult to be Summarized and Shown on the Screen,thus Difficult t
8、o Browse The Speech Information,if Included,usually Tells the Subjects,Topics and Concepts of the Multimedia Content,thus Becomes the Key for Browsing and Retrieval Multimedia Content Analysis based on Speech InformationFuture Integrated NetworksRealtime Information weather,traffic flight schedule s
9、tock price sports scoresElectronic Commerce virtual banking online transactions online investmentsKnowledge Archieves digital libraries virtual museumsIntelligent Working Environment email processors intelligent agents teleconferencing distant learningPrivate Services personal notebook business data
10、bases home appliances network entertainmentsUser-Content Interaction Wireless and Multimedia Technologies are Creating An Era of Network Access by Spoken Language Processingvoice informationMultimedia ContentInternetvoice input/outputtext information Network Access is Primarily Text-based today,but
11、almost all Roles of Texts can be Accomplished by Speech User-Content Interaction can be Accomplished by Spoken and Multi-modal Dialogues Many Hand-held Devices with Multimedia Functionalities Commercially Available Today Using Speech Instructions to Access Multimedia Content whose Key Concepts Speci
12、fied by Speech InformationMultimedia Content AnalysisText Information RetrievalText ContentVoice-based Information RetrievalText-to-Speech SynthesisSpoken and multi-modal DialogueVoice-based Information RetrievalVoice Instructions我想找有關紐約受到恐怖攻擊的新聞?Text Instructionsd1Text Informationd2d3d1d2d3Voice In
13、formation 美國總統布希今天早上Speech may become a New Data TypeBoth the User Instructions and Network Content Can be in form of SpeechSpoken and Multi-modal Dialogues Almost All User-Content Interaction can be Accomplished by Spoken or Multi-modal DialoguesAn Example of Client-Server Computing EnvironmentData
14、basesSentence Generation and Speech SynthesisOutput SpeechInput SpeechDialogueManagerSpeech Recognition and UnderstandingUsers IntentionDiscourse ContextResponse to the userInternetWireless NetworksUsersDialogue ServerConvergence of PSTN and Internet PSTN(for Voice)and Internet(for Data and Multi-me
15、dia Contents)are Converging Driving Force for the Convergence“anywhere,any time”of wireless services voice provides the most convenient and natural interaction interface attractive contents over the Internet contents(human information)are why the Internet is attractive,while voice directly carries h
16、uman information Speech-enabled Access of Web-based ApplicationshandsetstelephonesPSTNInternetPCsserversWireless Access of Global Information As Handset Size Shrinks While Required Functionalities Grows and the User Environment Changes,Voice Interface will be Useful for all Different User Terminals
17、As More Network Content becomes Multi-media,Content Analysis based on Speech Information will be Essential Integration of Many Different Technologies information processing,networking,transmission,internet,wireless,speech processing Speech Processing is the only Major Missing Link in the Semi-mature
18、 Technology Chain W e b S e r v e r C o r p o r a t e I n t r a n e t I n t e l l i g e n t A g e n t W L A N A P C o r e N e t w o r k B r o a d b a n d W i r e l e s s A c c e s s 3 G C e l l u l a r S y s t e m s E D G E/U W C-1 3 6 A T M o r I P B a c k b o n e T h e I n t e r n e t P S T N Futu
19、re World of Communications and Computing Speech Processing Technologies Wireless Technologies Communications and Networking Technologies.0110.1101.satellitesserversGlobal Knowledge,Information and ServicesradiofiberCcableNetworks Multi-media Technologies Information Processing TechnologiesOutline Bo
20、th Theoretical Issues and Practical Problems will be Discussed Starting with Fundamentals,but Entering Research Topics Gradually Part I:Fundamental Topics 1.0 Introduction to Digital Speech Processing 2.0 Fundamentals of Speech Recognition 3.0 Map of Subject Areas 4.0 More about Hidden Markov Models
21、 5.0 Acoustic Modeling 6.0 Language Modeling 7.0 Speech Signals and Front-end Processing 8.0 Search Algorithms for Speech Recognition Part II:Advanced Topics 9.0 Speaker Variabilities:Adaption and Recognition10.0 Latent Semantic Analysis for Linguistic Processing11.0 Spoken Document Understanding an
22、d Organization12.0 Voice-based Information Retrieval13.0 Robustness for Acoustic Environment14.0 Some Fundamental Problem-solving Approaches15.0 Utterance Verification and Keyword/Key Phrase Spotting16.0 Spoken Dialogues17.0 Distributed Speech Recognition and Wireless Environment18.0 Some Recent Dev
23、elopments in NTU19.0 ConclusionOutline 教科書:無教科書:無 主要參考書:主要參考書:1.X.Huang,A.Acero,H.Hon,“Spoken Language Processing”,Prentice Hall,2001,松瑞2.F.Jelinek,“Statistical Methods for Speech Recognition”,MIT Press,19993.L.Rabiner,B.H.Juang,“Fundamentals of Speech Recognition”,Prentice Hall,1993,民全4.C.Becchetti
24、,L.Prina Ricotti,“Speech Recognition-Theory and C+implementation”,Johy Wiley and Sons,1999,民全 5.其他參考文獻課堂上提供 教材:教材:available on web before the day of class(http:/speech.ee.ntu.edu.tw)適合年級:三、四(電機系、資工系)適合年級:三、四(電機系、資工系)課程目的:提供同學進入此一充滿機會與挑戰的新領域所需的基本知識,體課程目的:提供同學進入此一充滿機會與挑戰的新領域所需的基本知識,體驗數學模型與軟體程式如何相輔相成,學
25、習進入一個新領域由基礎進入研究驗數學模型與軟體程式如何相輔相成,學習進入一個新領域由基礎進入研究的歷程,體會吸收非結構性知識的歷程,體會吸收非結構性知識(Unstructured Knowledge)的經驗的經驗 成績評量方式成績評量方式Midterm Exam 25%Homeworks(I)(II)()15%、5%、15Final Exam 10%Term Project 30%1.0 Introduction A Brief Summary of Core Technologies and Current StatusReferences for 1.01.“Speech and Lang
26、uage Processing over the Web”,IEEE Signal Processing Magazine,May 20082.“Voice Access of Global Information for Broadband Wireless:Technologies of Today and Challenges of Tomorrow”,Proceedings of IEEE,Jan 20013.“Conversational Interfaces:Advances and Challenges”,Proceedings of the IEEE,Aug 2000Featu
27、re Extractionunknown speech signalPattern MatchingDecision Makingx(t)WXoutput wordfeature vector sequenceReference PatternsFeature Extractiony(t)Ytraining speechSpeech Recognition as a pattern recognition problemA Simplified Block DiagramExample Input Sentence this is speechAcoustic Models (th-ih-s-
28、ih-z-s-p-ih-ch)Lexicon (th-ih-s)this (ih-z)is (s-p-iy-ch)speechLanguage Model (this)(is)(speech)P(this)P(is|this)P(speech|this is)P(wi|wi-1)bi-gram language model P(wi|wi-1,wi-2)tri-gram language model,etcBasic Approach for Large Vocabulary Speech RecognitionFront-endSignal ProcessingAcousticModelsL
29、exiconFeatureVectorsLinguistic Decoding and Search AlgorithmOutput SentenceSpeechCorporaAcousticModelTrainingLanguageModelConstructionTextCorporaLexicalKnowledge-baseLanguageModelInput SpeechGrammarSpeech Recognition Technologies,Applications and ProblemsWord Recognitionvoice command/instructionsKey
30、word Spottingidentifying the keywords out of a pre-defined keyword set from input voice utterancesLarge Vocabulary Continuous Speech Recognitionentering longer textsremote dictation/automatic transcriptionSpeaker Dependent/Independent/AdaptiveAcoustic Reception/Background Noise/Channel DistortionRea
31、d/Spontaneous/Conversational SpeechText-to-speech SynthesisText Analysis and Letter-to-sound ConversionProsody GenerationSignal Processingand ConcatenationLexicon and RulesProsodic ModelVoice Unit DatabaseInput TextOutput Speech SignalTransforming any input text into corresponding speech signals E-m
32、ail/Web page reading Prosodic modeling Basic voice units/rule-based,non-uniform units/corpus-basedSpeech Understanding Understanding Speakers Intention rather than Transcribing into Word Strings Limited Domains/Finite Tasks Grammatical Approaches(e.g.partial parsing)/Statistical Approaches(e.g.corpu
33、s-based by training)Semantic Concepts/Key Phrasesacoustic modelsphrase lexiconSyllable RecognitionKey Phrase Matchinginput utterancesyllable latticephrase graphconcept graphconcept setphrase/concept language modelSemantic Decodingunderstanding resultsProb(Ci|Ci-1,Ci-2)Prob(phj|Ci)An Example utteranc
34、e:請幫我查一下 台灣銀行 的 電話號碼 是幾號?key phrases:(查一下)-(台灣銀行)-(電話號碼)concept:(inquiry)-(target)-(phone number)Speaker VerificationFeature ExtractionVerificationinput speechyes/noVerifying the speaker as claimedApplications requiring verification Text dependent/independentIntegrated with other verification scheme
35、sSpeaker ModelsVoice-based Information Retrieval Speech Instructions Speech Documents(or Multi-media Documents including Speech Information)Indexing Features/Relevance Evaluation Recall/Precision Ratesspeech instruction我想找有關新政府組成的新聞?text instructiond1text documentsd2d3d1d2d3speech documents總統當選人陳水扁今
36、天早上Spoken Dialogue SystemsAlmost all human-network interactions can be made by spoken dialogueSpeech understanding,speech synthesis,dialogue managementSystem/user/mixed initiativesReliability/efficiency,dialogue modeling/flow controlTransaction success rate/average number of dialogue turnsDatabasesS
37、entence Generation and Speech SynthesisOutput SpeechInput SpeechDialogueManagerSpeech Recognition and UnderstandingUsers IntentionDiscourse ContextResponse to the userInternetNetworksUsersDialogue ServerSpoken Document Understanding and Organization Unlike the Written Documents which are Better Stru
38、ctured and Easier to Index and Browse,Spoken Documents are just Audio Signals,or a Sequence of Words if Transcribed the user cant listen to(or read carefully)each one from the beginning to the end during browsing better approaches for understanding/organization of spoken documents becomes necessary
39、Spoken Document Segmentation automatically segmenting a spoken document into short paragraphs,each with a central topic Spoken Document Summarization automatically generating a summary(in text or speech form)for each short paragraph Title Generation for Spoken Documents automatically generating a ti
40、tle(in text or speech form)for each short paragraph Semantic Structuring of Spoken Documents construction of semantic structure of spoken documents into graphical hierarchiesMulti-lingual Functionalities Code-Switching Problem English words/phrases inserted in spoken Chinese sentences as an example人
41、人都用Computers,家家都上Internet the whole sentence switched from Chinese to English as an example 準備好了嗎?Lets go!Cross-language Network Information Processing globalized network with multi-lingual content/users cross-language network information processing with a certain input language Dialects/Accents hun
42、dreds of Chinese dialects as an example code-switching problem Chinese dialects mixed with Mandarin(or plus English)as an example Mandarin with a variety of strong accents as an example Global/Local Languages Language Dependent/Independent Technologies Shared Acoustic Units/Integrated Linguistic Str
43、uctures An Example Partition of Speech Recognition Processes into Client/SeverDistributed Speech Recognition(DSR)and Wireless EnvironmentFront-endSignal ProcessingAcousticModelsLexiconFeatureVectorsLinguistic Decoding and Search AlgorithmOutput SentenceSpeechCorporaAcousticModelTrainingLanguageModel
44、ConstructionTextCorporaLexicalKnowledge-baseLanguageModelInput SpeechGrammar encoded feature parameters transmitted in packetsClient/Server StructureServerServerClientsNetworkClientDistributed Speech Recognition(DSR)and Wireless Environment Wireless Environment examples:Personal Area Networks(Blueto
45、oth,etc.),Wireless LAN(IEEE 802.11),Cellular(GSM,GPRS,3G),etc.Link Level time-varying fading and noise characteristics time-varying signal level and signal-to-noise ratios bursty errors with much higher error rates much smaller and dynamic bandwidth,much lower and changing bit rates Transport Level TCP/IP:errors retransmission delay UDP/IP:errors real-time/no delay packet loss packets out of sequenceApplicationLevelCore TechnologiesTransportLevelTransport LayerNetwork Layer(IP)LinkLevelData LinkLayerPhysical Layer