Chao Qin and Miguel A. Carreira-Perpi n an Electrical - PowerPoint PPT Presentation

Articulatory inversion of American English /ô/ by conditional density modes ❦ Chao Qin and Miguel ´ A. Carreira-Perpi˜ n´ an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Articulatory inversion The problem of recovering the sequence of vocal tract shapes that produce a given acoustic utterance (we consider the instantaneous inverse mapping). Difficult because the inverse mapping is multivalued (and nonlinear). articulatory configurations articulatory vector x ? acoustic vector y f : x → y forward mapping ? g : y → x inverse mapping acoustic signal Applications: speech coding, ASR, visualisation, therapy, etc. p. 1

The evidence for nonuniqueness Indirect evidence from articulatory models (Atal et al 1978) , but not realistic, and from bite-block experiments (Lindblom et al 1979) , but not natural speech. Direct evidence based on real data for normal speech: ❖ At least one well-documented case (e.g. Espy-Wilson et al 2000) : the American English /ô/ , produced with retroflex and bunched shapes. ❖ Large-scale statistical study of the vocal-tract shape for normal speech (American English) in the Wisconsin X–ray microbeam database (Qin and Carreira-Perpiñán 2007, 2009) : ∼ 15% of the acoustic frames show multiple articulations, including /ô/ and other sounds. ⇒ The nonuniqueness is definitely there, but not very frequently. Since previous work on articulatory inversion has used data containing little nonuniqueness, the differences between algorithms are small. p. 2

The evidence for nonuniqueness (cont.) Spectral env. (dB/Hz) Vocal tract shapes (mm) Spectral env. (dB/Hz) Vocal tract shapes (mm) 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /ô/ /æ/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /l/ /u:/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 0 2000 4000 6000 8000 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 −80 −60 −40 −20 0 20 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /w/ /y/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 0 2000 4000 6000 8000 From XRMB: /ô/ , tp009 “row”; /l/ , tp037 “long”; /w/ , tp044 “work”; /æ/ , tp001 “has”; /u:/ , tp001 “school”; /y/ , tp040 “you”. p. 3

Representative approaches to articulatory inversion ❖ Analysis-by-synthesis inverts a nonlinear forward mapping f (assumed univalued) from articulators x to acoustics y by minimising the acoustic error � y − f ( x ) � . Slow and returns invalid shapes unless the initial x is close to the solution. ❖ Other methods directly learn a nonlinear inverse mapping g : y → x : neural net (Shirai and Kobayashi 1991) , MDN (Richmond et al 2003) , etc. With multiple inverses, this results in their average, which can be an invalid shape. ❖ The distal teacher approach (Jordan and Rumelhart 1992) learns a valid inverse, but only one of them. ❖ Other methods incorporate some sequential information, e.g. by using a time-delay neural net or a trajectory smoothing (Toda et al 2008) . ❖ Few methods have directly attempted to represent the multivalued inverse explicitly: the codebook method (Schroeter and Sondhi 1994) and the conditional modes method (Carreira-Perpiñán 2000 and this work) . p. 4

The problem with univalued mappings ❖ The best univalued mapping g : y → x in the least-squares sense � � x − g ( y ) � 2 � ( min g E p ( x , y ) ) is the conditional mean g ( y ) = E { x | y } . But the mean of two valid inverses may not be a valid inverse. ❖ A univalued mapping will learn the conditional mean or possibly a single inverse, but cannot learn more than one inverse. ❖ We can define a multivalued mapping by the modes of the conditional distribution p ( x | y ) . The number of inverses (modes) g ( y ) = f − 1 ( y ) naturally varies as a function of y . p ( x , y ) ac. y p ( x | y = /ô/ ) /ô/ retroflex mean bunched retroflex mean bunched art. x art. x p. 5

Inversion by conditional density modes (Carreira-Perpiñán 2000) ❖ Offline, given an training set of articulatory-acoustic vector pairs { ( x n , y n ) } , learn a conditional density p ( x | y ) . Gaussian mixture, kernel density estimate, mixture of experts (e.g. MDN). . . ❖ At runtime, given a test acoustic sequence y 1 , . . . , y T : t , . . . , x K t ✦ For each frame y t , find all possible inverses x 1 t . Find all modes of the conditional density p ( x | y t ) , using e.g. the mean-shift algorithm (Carreira-Perpiñán 2000) . This is the computational bottleneck. ✦ Find a unique vocal-tract shape sequence x 1 , . . . , x T by minimising over the set of modes at all frames the following constraint: T − 1 T � � E ( x 1 , . . . , x T ) = � x t +1 − x t � + λ � y t − f ( x t ) � . t =1 t =1 � �� continuity validity Exact solution by dynamic programming in O ( Tν 2 ) ( ν = average number of modes per frame). p. 6

Inversion by conditional density modes (cont.) In the codebook method (Schroeter and Sondhi 1994) : ❖ A large ( 10 5 + entries) table of articulatory-acoustic pairs ( x n , y n ) is constructed by sampling an articulatory model and possibly doing vector quantisation. ❖ Difficult to remove unrealistic/atypical shapes and to achieve a uniform, comprehensive sampling (nonlinear manifold). ❖ The dynamic programming search is very slow. ❖ The resulting trajectory shows discretisation artifacts. In the conditional density modes method: ❖ The joint density p ( x , y ) replaces the codebook and is far smaller in memory. ❖ By learning it from real articulatory data of normal speech we ensure it represents feasible and typical shapes. ❖ The mode finding is slow (but it can be accelerated); the dynamic programming is very fast. ❖ The resulting trajectory is smooth. p. 7

A dataset of American English /ô/ sequences ❖ To test the ability of inversion methods to deal with nonuniqueness, we create a dataset with several sequences of American English /ô/ , extracted by hand from the Wisconsin X–ray microbeam database (XRMB) (Westbury 1994) . ❖ The XRMB records simultaneously the audio and 2D positions of several pellets on the tongue, lips, etc. on the midsagittal plane. ❖ This gives an incomplete representation of the vocal tract (no data beyond the velum). 40 Palate outline UL 20 T4 Y (mm) T3 0 T2 MNM T1 −20 Pharyngeal wall MNI LL −40 −95 −75 −55 −35 −15 5 25 X (mm) p. 8

A dataset of American English /ô/ sequences (cont.) ❖ Speaker jw11 . ❖ American English /ô/ : 402 frames (training) and 6 test trajectories (3 retroflex, e.g. “right”, “roll” + 3 bunched, e.g. “rag”, “row”) chosen manually. ❖ Acoustic features: 20th-order LPC. 40 30 Training data Testing data 30 20 Amplitude (dB) UL 20 10 10 T1−T4 0 0 −10 −10 MNM −20 MNI LL −20 −30 −30 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −80 −60 −40 −20 0 20 p. 9

Experimental setup: methods ❖ Joint articulatory-acoustic density p ( x , y ) : Gaussian kernel density estimate with σ = 11 mm on the training set, from which we derive the conditional density p ( x | y t ) of articulators given acoustics. mean : the conditional mean. ✦ dpmode : the conditional modes picked by the dynamic ✦ programming (we used only the continuity constraint). mode : the modes closest to the ground truth articulator ✦ sequence x 1 , . . . , x T (oracle). ❖ rbf : a radial basis function to learn the inverse mapping g (asymptotically equivalent to mean ). p. 10

Reconstructed sequence for retroflex /ô/ “roll” t p 0 9 6 Conditional density modes , mean , truth . ground F1 (nmodes = 5) F2 (nmodes = 5) F3 (nmodes = 5) F4 (nmodes = 5) F5 (nmodes = 4) 20 20 20 20 20 0 0 0 0 0 −20 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 F6 (nmodes = 4) F7 (nmodes = 3) F8 (nmodes = 3) F9 (nmodes = 3) 20 20 20 20 0 0 0 0 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 Reconstructions: dpmode , mean , rbf , truth . ground 20 20 20 20 20 0 0 0 0 0 −20 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 20 20 20 20 0 0 0 0 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 p. 11

Chao Qin and Miguel A. Carreira-Perpi n an Electrical - PowerPoint PPT Presentation

Articulatory inversion of American English // by conditional density modes Chao Qin and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Ramin Raziperchikolaei and Miguel A. Carreira-Perpi n an Electrical Engineering and

Max Vladymyrov and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer

Ramin Raziperchikolaei and Miguel A. Carreira-Perpi n an Electrical Engineering and

Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science

Predicting Tongue Shapes From A Few Landmark Locations Chao Qin 1 , Miguel . Carreira-Perpin

China Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE Army of the First

Trajectory Inverse Kinematics By Conditional Density Models Chao Qin and Miguel .

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

Quality Improvement Network (QIN) Grant Application Review Meeting: Readers 1 Agenda I.

A Presentation by: Mr. Tsolomitis The End of the Qin The Chinese people were unhappy with the

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Smaller, more accurate regression forests using tree alternating optimization Arman

Enabling Building Energy Auditing Using Adapted Occupancy Models 40 Ankur U. Kamthe , Varick L.

Dnes Gazsi University of Iowa The Persian Dialects of the Ajam in Kuwait 1. Al- Ghnim , Gh.

By the Numbers Mission and Vision Mission: PTCB advances medication 603,500 pharmacy technician

12/1/2015 Civil Rules Amendments RULEMAKING PROCESS Amendments adopted April 10-11, 2014 by the

Please remember to mute your speakers. VA Mobile Discussion Series For audio, please dial in using

Minnesota Atlas of Childrens Health Care, 2014 - 2015 Pamela Mink, PhD, MPH NAHDO Annual

Interaction effects in topological insulators - New Phases Lecture 2 (Provisional Notes) Ashvin

Particle dynamics Particle overview Particle system Forces Constraints

Verifying a Concurrent Garbage Collector Delphine Demange Suresh Jagannathan Gustavo Petri

Chao Qin and Miguel A. Carreira-Perpi n an Electrical - PowerPoint PPT Presentation

Articulatory inversion of American English // by conditional density modes Chao Qin and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu

Ramin Raziperchikolaei and Miguel A. Carreira-Perpi n an Electrical Engineering and

Max Vladymyrov and Miguel A. Carreira-Perpi n an Electrical Engineering and Computer

Ramin Raziperchikolaei and Miguel A. Carreira-Perpi n an Electrical Engineering and

Miguel A. Carreira-Perpi n an Electrical Engineering and Computer Science

Predicting Tongue Shapes From A Few Landmark Locations Chao Qin 1 , Miguel . Carreira-Perpin

China *Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE *Army of the First

Trajectory Inverse Kinematics By Conditional Density Models Chao Qin and Miguel .

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

Partial-Hessian Strategies for Fast Learning of Nonlinear Embeddings Max Vladymyrov and Miguel

Quality Improvement Network (QIN) Grant Application Review Meeting: Readers 1 Agenda I.

A Presentation by: Mr. Tsolomitis The End of the Qin The Chinese people were unhappy with the

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Ramin Raziperchikolaei Electrical Engineering and Computer Science University of California,

Smaller, more accurate regression forests using tree alternating optimization Arman

Enabling Building Energy Auditing Using Adapted Occupancy Models 40 Ankur U. Kamthe , Varick L.

Dnes Gazsi University of Iowa The Persian Dialects of the Ajam in Kuwait 1. Al- Ghnim , Gh.

By the Numbers Mission and Vision Mission: PTCB advances medication 603,500 pharmacy technician

12/1/2015 Civil Rules Amendments RULEMAKING PROCESS Amendments adopted April 10-11, 2014 by the

Please remember to mute your speakers. VA Mobile Discussion Series For audio, please dial in using

Minnesota Atlas of Childrens Health Care, 2014 - 2015 Pamela Mink, PhD, MPH NAHDO Annual

Interaction effects in topological insulators - New Phases Lecture 2 (Provisional Notes) Ashvin

Particle dynamics Particle overview Particle system Forces Constraints

Verifying a Concurrent Garbage Collector Delphine Demange Suresh Jagannathan Gustavo Petri

China Army of the First Emperor of Qin, China, Qin Dynasty ca. 210 BCE Army of the First