Speech Processing 15-492/18-492 Speech Synthesis Talking heads - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis

More Information is Better Voice + text is easier to understand � Voice + text is easier to understand � Voice + face is easier too � Voice + face is easier too �

Talking Heads Adds novelty/character/personification � Adds novelty/character/personification � Experiments show better understanding show better understanding � Experiments � � Lip synching Lip synching � � Facial movements Facial movements � Listeners swear its better synthesis � Listeners swear its better synthesis �

Talking heads

Talking Heads Synthesize text � Synthesize text � � Output phone position in audio stream Output phone position in audio stream � Map phones to lip/tongue positions � Map phones to lip/tongue positions � Build visual stream � Build visual stream � � Choose appropriate frames Choose appropriate frames � � Aligned with audio Aligned with audio � How many facial positions � How many facial positions �

Visemes � Baphy Baphy Three positions Three positions � � Closed, open and rounded Closed, open and rounded � � Rho Rho � � 10 lip positions 10 lip positions � � Eyelid 4 Eyelid 4 � � Eyes 2 Eyes 2 � � When should the align When should the align � � Follow trajectories, not just at time instant Follow trajectories, not just at time instant � � Shape for syllables not just phones Shape for syllables not just phones �

Synthesis Analogies � Articulatory Articulatory Synthesis Synthesis � Modeling the vocal tract Modeling the vocal tract � � Baldi: movement of muscles : movement of muscles Baldi � � � Format: Format: � Modeling of signal synthetically Modeling of signal synthetically � � Carton based faces (Baphy Baphy) ) Carton based faces ( � � � Concatenative Concatenative � Joining natural segments Joining natural segments � � JPL example JPL example � � Interval’s Video Rewrite Interval’s Video Rewrite � � � Unit size Unit size � Baphy == == uniphone uniphone Baphy � � JPL == JPL == diphone diphone � � Video Rewrite == unit selection Video Rewrite == unit selection � �

Talking Heads � Personalization: Personalization: � � Can look like a mask put on a dummy Can look like a mask put on a dummy � � Uncanny valley Uncanny valley � � The more human like, the more critical we are The more human like, the more critical we are � � 3 3- -D movement (in real time) D movement (in real time) � � Second Second- -life type characters life type characters � � Gesture generation too Gesture generation too � � Off Off- -line line � � (Gollum, (Gollum, Jabba Jabba the Hut) the Hut) � � Usually actors do the voices Usually actors do the voices �

Singing Synthesis Simple pitch and duration control � Simple pitch and duration control � � But singing is more than that But singing is more than that � Proper singing synthesis � Proper singing synthesis � � Recording a singing database Recording a singing database �  Phonetic, prosodic, and singing style coverage Phonetic, prosodic, and singing style coverage  � Sang rather than spoken voice Sang rather than spoken voice �

Flinger (Festival Singer) (Macon) Sinusoidal modeling � Sinusoidal modeling � � More pitch control than just PSOLA More pitch control than just PSOLA � MIDI interface � MIDI interface � � Allow mixing with music Allow mixing with music � � Standard MIDI authoring techniques Standard MIDI authoring techniques �

Festival Singing Mode � Dominic Dominic Mazzoni Mazzoni (11 (11- -752 project 2001) 752 project 2001) � � XML based song description XML based song description � � <DURATION BEATS=“1.0”> <DURATION BEATS=“1.0”> � � <PITCH NOTE=“C4”>Oh</PITCH> <PITCH NOTE=“C4”>Oh</PITCH> � � </DURATION> </DURATION> � � But not just setting pitch at duration point But not just setting pitch at duration point � � When do you move it (based on syllable and voicing) When do you move it (based on syllable and voicing) � � How quickly do you move pitch How quickly do you move pitch �

Singing Example <?xml version="1.0"?> <?xml version="1.0"?> � � <!DOCTYPE SINGING PUBLIC "- -//SINGING//DTD SINGING mark up//EN" //SINGING//DTD SINGING mark up//EN" <!DOCTYPE SINGING PUBLIC " "Singing.v0_1.dtd" "Singing.v0_1.dtd" []> []> <SINGING BPM="30"> <SINGING BPM="30"> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G3"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="A3"><DURATION BEATS="0.3">ray</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="B3"><DURATION BEATS="0.3">me</DURATION></PITCH> <PITCH NOTE="C4"><DURATION BEATS="0.3">fah <PITCH NOTE="C4"><DURATION BEATS="0.3"> fah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="D4"><DURATION BEATS="0.3">sew</DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3">lah lah</DURATION></PITCH> </DURATION></PITCH> <PITCH NOTE="E4"><DURATION BEATS="0.3"> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="F#4"><DURATION BEATS="0.3">tee</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> <PITCH NOTE="G4"><DURATION BEATS="0.3">doe</DURATION></PITCH> </SINGING> </SINGING>

Future in TTS � More natural voices More natural voices � � Sound human Sound human � � Interact in a human way (not just words) Interact in a human way (not just words) � � More personalization More personalization � � Sound like a particular person Sound like a particular person � � Cross lingual synthesis Cross lingual synthesis � � More flexible More flexible � � Say it with more feeling Say it with more feeling � � Realtime Realtime voice transformation voice transformation � � Have an American accent while you speak Have an American accent while you speak �

Text to speech process Text analysis � Text analysis � � From characters to words From characters to words � Linguistic analysis � Linguistic analysis � � From words to pronunciations From words to pronunciations � Waveform analysis � Waveform analysis � � From pronunciations to noises From pronunciations to noises �

HW2: TTS Due 3:30pm Monday October 20 th th � Due 3:30pm Monday October 20 � Install Festival and Festvox Festvox � Install Festival and � Find 10 errors in each of two different � Find 10 errors in each of two different � synthesizers synthesizers Build a voice � Build a voice � � A Talking Clock A Talking Clock � � A general voice A general voice � � (or both) (or both) �

Speech Processing 15-492/18-492 Speech Synthesis Talking heads - PowerPoint PPT Presentation

Speech Processing 15-492/18-492 Speech Synthesis Talking heads Singing Synthesis More Information is Better Voice + text is easier to understand Voice + text is easier to understand Voice + face is easier too Voice + face is

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Spoken Dialog Systems SDS

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Speech Processing 15-492/18-492 Emotional Speech (Some slides taken form JHU Workshop 2011 final

Speech Processing 15-492/18-492 Speech Recognition Acoustic modeling Pronunciation dictionary

Speech Processing 15-492/18-492 Speech Translation Speech Translation Three part systems

Speech Processing 15-492/18-492 Speech Synthesis Evaluation Evaluating Speech Synthesis How

Time Correlated Single Photon Counting Anindya Datta Department of Chemistry Indian Institute of

The excess charge problem in the relativistic Thomas-Fermi-Weizscker theory Hongshuo Chen

CS256/Winter 2009 Lecture #11 Zohar Manna Beyond Temporal Logics Temporal logic expresses

More NP-Complete Problems Lecture 4 September 5, 2013 Sariel (UIUC) CS573 1 Fall 2013 1 / 48

Quantifjcational subordination as anaphora to a function Matthew Gotham University of Oxford

On Transposition Tables for Single-Agent Search and Planning: Summary of Results (Akagi,

The devil shell (dsh) Continued COMPSCI210 Recitation 11th Feb 2013 Vamsi Thummala Shell

Modular rollback through free monads Conor McBride, Olin Shivers, Aaron Turon Tuesday, September

Sambuz

Useful Links

Newsletter

Mail Us