speech processing 15 492 18 492
play

Speech Processing 15-492/18-492 Using Speech with Computers Alan W - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Using Speech with Computers Alan W Black August 2008 Overview Practical and Theory: Practical and Theory: Understand concepts, Implement Solutions Understand concepts, Implement Solutions


  1. Speech Processing 15-492/18-492 Using Speech with Computers Alan W Black August 2008

  2. Overview Practical and Theory: � Practical and Theory: � � Understand concepts, Implement Solutions Understand concepts, Implement Solutions � Speech Recognition � Speech Recognition � � Speech to text Speech to text � Speech Synthesis � Speech Synthesis � � Text to Speech Text to Speech � Spoken Dialog Systems � Spoken Dialog Systems � � Interaction with machines Interaction with machines �

  3. Course Schedule MWF 3:30- -4:20 4:20 � MWF 3:30 � DH 1117 � DH 1117 � Lecturer: Alan W Black (awb@cs.cmu.edu awb@cs.cmu.edu) ) � Lecturer: Alan W Black ( � TA: David Huggins (dhuggins@cs.cmu.edu dhuggins@cs.cmu.edu) ) � TA: David Huggins ( � http://www.speech.cs.cmu.edu/15- -492/ 492/ � http://www.speech.cs.cmu.edu/15 �

  4. Course Details Three lectures a week � Three lectures a week � 4 Homeworks Homeworks � 4 � � Speech Recognition Speech Recognition � � Speech Synthesis Speech Synthesis � � Spoken Dialog System Spoken Dialog System � � Other Other � Final Exam � Final Exam �

  5. Homeworks (Mostly) Practical � (Mostly) Practical � � Build something that talks/can be spoken to Build something that talks/can be spoken to � � Software and speech data will be provided Software and speech data will be provided �  Will run on Windows/Linux or OSX Will run on Windows/Linux or OSX   Access to Linux servers if required Access to Linux servers if required  � Written description of what you did Written description of what you did �

  6. Schedule Details th ) Week 1 (Aug 15 th ) � Week 1 (Aug 15 � � Applications, Human and Computer Speech Applications, Human and Computer Speech � Processing Processing rd ) Speech Recognition 4 (Sep 3 rd Week 2- -4 (Sep 3 ) Speech Recognition � Week 2 � � Signal representation, acoustic modeling Signal representation, acoustic modeling � � Language modeling, applications Language modeling, applications � � Tuning, evaluation, expectations Tuning, evaluation, expectations �

  7. Course Details nd Sep) Speech Synthesis Week 5- -7 (22 7 (22 nd Sep) Speech Synthesis � Week 5 � � Text processing, prosody, waveform synthesis Text processing, prosody, waveform synthesis � � Building voices, evaluations, voice conversion Building voices, evaluations, voice conversion � th Oct) Week 8 (13 th Oct) Multilinguality Multilinguality � Week 8 (13 � � Supporting new languages efficiently Supporting new languages efficiently � th Oct) Dialog Systems 11 (20 th Week 9- -11 (20 Oct) Dialog Systems � Week 9 � � VoiceXML VoiceXML, Mixed initiative, barge , Mixed initiative, barge- -in in � � Design, installation and tuning. Design, installation and tuning. �

  8. Course Details th Nov) Week 12 (10 th Nov) � Week 12 (10 � � Speech to Speech translation Speech to Speech translation � � Language support, tight integration Language support, tight integration � th Nov) Week 13 (17 th Nov) � Week 13 (17 � � Evaluation and expectations Evaluation and expectations � th ) Week 14 (24 th ) � Week 14 (24 � � Speaker ID, Silent Speech, Conversion Speaker ID, Silent Speech, Conversion � � What still needs to be done. What still needs to be done. � st Dec) Week 15 (1 st Dec) � Week 15 (1 � � Exam Exam �

  9. Why Speech Most natural way to communicate � Most natural way to communicate � � (For Humans) (For Humans) � Not ideal for everything � Not ideal for everything � � Graphics and text can be better (sometimes) Graphics and text can be better (sometimes) � Doesn’t compress well � Doesn’t compress well � Hard to search � Hard to search �

  10. Compression Alice in Wonderland � Alice in Wonderland � � Text Text �  150K uncompressed 150K uncompressed   43K compressed 43K compressed  � Speech (2hrs 20mins) Speech (2hrs 20mins) �  270M uncompressed 270M uncompressed   600K compressed (mp3, 24KBS) 600K compressed (mp3, 24KBS) 

  11. Searching Find all NPR broadcasts mentioning Obama Obama � Find all NPR broadcasts mentioning � � Listen to them all Listen to them all � From lecture recordings � From lecture recordings � � Find all occurrences of “this will be in the exam” Find all occurrences of “this will be in the exam” � So listen to it faster … � So listen to it faster … � � Normal 2x speed Normal 2x speed � � 2x 4x 8x 2x 4x 8x �

  12. Eyes/Hands Free � Interaction when driving Interaction when driving � � Look at screen to see next turnoff Look at screen to see next turnoff � � “In 200 yards turn right onto Murray Ave.” “In 200 yards turn right onto Murray Ave.” � � Blind users/ Assistive technology Blind users/ Assistive technology � � Text isn’t very useful Text isn’t very useful � � Alerts Alerts � � “Will self “Will self- -destruct in 10 seconds” destruct in 10 seconds” vs vs � � blinking light blinking light � � Telephone dialog systems Telephone dialog systems �

  13. Speech Applications � Command and Control Command and Control � � Information Agents Information Agents � � Speech to Speech Translation Speech to Speech Translation � � Speech summarization Speech summarization � � Lecture or Meeting summarization Lecture or Meeting summarization � � Transcription/Dictation Transcription/Dictation � � Speaker Identification Speaker Identification � � emotion/dialect/language emotion/dialect/language � � Language Learning Language Learning �

  14. “Hot” Commercial Applications Location- -based services: based services: � Location � � Yahoo GO Yahoo GO � � Google Google Maps Maps � � Microsoft Live Search Microsoft Live Search � All phone/pda pda based based � All phone/ � � Use speech Use speech- -in in � � Directions speech Directions speech- -out out �

  15. Other Speech uses Spoken Dialog Systems Spoken Dialog Systems - - Let’s Go Public 412 268 3526 evenings 412 442 2000 Let’s Go Public 412 268 3526 evenings 412 442 2000 - - Pittsburgh bus timetables by phone Pittsburgh bus timetables by phone - - Assistive Technologies Assistive Technologies - - Screen readers Screen readers - - Augmentitive and assistive communication devices and assistive communication devices Augmentitive - - On- -line Personalization line Personalization On - - Blogcasts (your voice, or appropriate voice) Blogcasts (your voice, or appropriate voice) - - Game character customization Game character customization - - Talking Heads Talking Heads - - CMU’s roboceptionist roboceptionist CMU’s - - Singing Synthesis Singing Synthesis - - XML interface for song specification XML interface for song specification - -

Recommend


More recommend