Articulatory inversion of American English /ô/ by conditional density modes ❦ Chao Qin and Miguel ´ A. Carreira-Perpi˜ n´ an Electrical Engineering and Computer Science University of California, Merced http://eecs.ucmerced.edu
Articulatory inversion The problem of recovering the sequence of vocal tract shapes that produce a given acoustic utterance (we consider the instantaneous inverse mapping). Difficult because the inverse mapping is multivalued (and nonlinear). articulatory configurations articulatory vector x ? acoustic vector y f : x → y forward mapping ? g : y → x inverse mapping acoustic signal Applications: speech coding, ASR, visualisation, therapy, etc. p. 1
The evidence for nonuniqueness Indirect evidence from articulatory models (Atal et al 1978) , but not realistic, and from bite-block experiments (Lindblom et al 1979) , but not natural speech. Direct evidence based on real data for normal speech: ❖ At least one well-documented case (e.g. Espy-Wilson et al 2000) : the American English /ô/ , produced with retroflex and bunched shapes. ❖ Large-scale statistical study of the vocal-tract shape for normal speech (American English) in the Wisconsin X–ray microbeam database (Qin and Carreira-Perpiñán 2007, 2009) : ∼ 15% of the acoustic frames show multiple articulations, including /ô/ and other sounds. ⇒ The nonuniqueness is definitely there, but not very frequently. Since previous work on articulatory inversion has used data containing little nonuniqueness, the differences between algorithms are small. p. 2
The evidence for nonuniqueness (cont.) Spectral env. (dB/Hz) Vocal tract shapes (mm) Spectral env. (dB/Hz) Vocal tract shapes (mm) 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /ô/ /æ/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 0 2000 4000 6000 8000 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /l/ /u:/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 0 2000 4000 6000 8000 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 −80 −60 −40 −20 0 20 0 0 40 40 −10 −10 20 20 −20 −20 0 0 /w/ /y/ −30 −30 −20 −20 −40 −40 −40 −40 −50 −50 −80 −60 −40 −20 0 20 −80 −60 −40 −20 0 20 0 2000 4000 6000 8000 0 2000 4000 6000 8000 From XRMB: /ô/ , tp009 “row”; /l/ , tp037 “long”; /w/ , tp044 “work”; /æ/ , tp001 “has”; /u:/ , tp001 “school”; /y/ , tp040 “you”. p. 3
Representative approaches to articulatory inversion ❖ Analysis-by-synthesis inverts a nonlinear forward mapping f (assumed univalued) from articulators x to acoustics y by minimising the acoustic error � y − f ( x ) � . Slow and returns invalid shapes unless the initial x is close to the solution. ❖ Other methods directly learn a nonlinear inverse mapping g : y → x : neural net (Shirai and Kobayashi 1991) , MDN (Richmond et al 2003) , etc. With multiple inverses, this results in their average, which can be an invalid shape. ❖ The distal teacher approach (Jordan and Rumelhart 1992) learns a valid inverse, but only one of them. ❖ Other methods incorporate some sequential information, e.g. by using a time-delay neural net or a trajectory smoothing (Toda et al 2008) . ❖ Few methods have directly attempted to represent the multivalued inverse explicitly: the codebook method (Schroeter and Sondhi 1994) and the conditional modes method (Carreira-Perpiñán 2000 and this work) . p. 4
The problem with univalued mappings ❖ The best univalued mapping g : y → x in the least-squares sense � � x − g ( y ) � 2 � ( min g E p ( x , y ) ) is the conditional mean g ( y ) = E { x | y } . But the mean of two valid inverses may not be a valid inverse. ❖ A univalued mapping will learn the conditional mean or possibly a single inverse, but cannot learn more than one inverse. ❖ We can define a multivalued mapping by the modes of the conditional distribution p ( x | y ) . The number of inverses (modes) g ( y ) = f − 1 ( y ) naturally varies as a function of y . p ( x , y ) ac. y p ( x | y = /ô/ ) /ô/ retroflex mean bunched retroflex mean bunched art. x art. x p. 5
Inversion by conditional density modes (Carreira-Perpiñán 2000) ❖ Offline, given an training set of articulatory-acoustic vector pairs { ( x n , y n ) } , learn a conditional density p ( x | y ) . Gaussian mixture, kernel density estimate, mixture of experts (e.g. MDN). . . ❖ At runtime, given a test acoustic sequence y 1 , . . . , y T : t , . . . , x K t ✦ For each frame y t , find all possible inverses x 1 t . Find all modes of the conditional density p ( x | y t ) , using e.g. the mean-shift algorithm (Carreira-Perpiñán 2000) . This is the computational bottleneck. ✦ Find a unique vocal-tract shape sequence x 1 , . . . , x T by minimising over the set of modes at all frames the following constraint: T − 1 T � � E ( x 1 , . . . , x T ) = � x t +1 − x t � + λ � y t − f ( x t ) � . t =1 t =1 � �� � � �� � continuity validity Exact solution by dynamic programming in O ( Tν 2 ) ( ν = average number of modes per frame). p. 6
Inversion by conditional density modes (cont.) In the codebook method (Schroeter and Sondhi 1994) : ❖ A large ( 10 5 + entries) table of articulatory-acoustic pairs ( x n , y n ) is constructed by sampling an articulatory model and possibly doing vector quantisation. ❖ Difficult to remove unrealistic/atypical shapes and to achieve a uniform, comprehensive sampling (nonlinear manifold). ❖ The dynamic programming search is very slow. ❖ The resulting trajectory shows discretisation artifacts. In the conditional density modes method: ❖ The joint density p ( x , y ) replaces the codebook and is far smaller in memory. ❖ By learning it from real articulatory data of normal speech we ensure it represents feasible and typical shapes. ❖ The mode finding is slow (but it can be accelerated); the dynamic programming is very fast. ❖ The resulting trajectory is smooth. p. 7
A dataset of American English /ô/ sequences ❖ To test the ability of inversion methods to deal with nonuniqueness, we create a dataset with several sequences of American English /ô/ , extracted by hand from the Wisconsin X–ray microbeam database (XRMB) (Westbury 1994) . ❖ The XRMB records simultaneously the audio and 2D positions of several pellets on the tongue, lips, etc. on the midsagittal plane. ❖ This gives an incomplete representation of the vocal tract (no data beyond the velum). 40 Palate outline UL 20 T4 Y (mm) T3 0 T2 MNM T1 −20 Pharyngeal wall MNI LL −40 −95 −75 −55 −35 −15 5 25 X (mm) p. 8
A dataset of American English /ô/ sequences (cont.) ❖ Speaker jw11 . ❖ American English /ô/ : 402 frames (training) and 6 test trajectories (3 retroflex, e.g. “right”, “roll” + 3 bunched, e.g. “rag”, “row”) chosen manually. ❖ Acoustic features: 20th-order LPC. 40 30 Training data Testing data 30 20 Amplitude (dB) UL 20 10 10 T1−T4 0 0 −10 −10 MNM −20 MNI LL −20 −30 −30 0 1000 2000 3000 4000 5000 6000 7000 8000 Frequency (Hz) −80 −60 −40 −20 0 20 p. 9
Experimental setup: methods ❖ Joint articulatory-acoustic density p ( x , y ) : Gaussian kernel density estimate with σ = 11 mm on the training set, from which we derive the conditional density p ( x | y t ) of articulators given acoustics. mean : the conditional mean. ✦ dpmode : the conditional modes picked by the dynamic ✦ programming (we used only the continuity constraint). mode : the modes closest to the ground truth articulator ✦ sequence x 1 , . . . , x T (oracle). ❖ rbf : a radial basis function to learn the inverse mapping g (asymptotically equivalent to mean ). p. 10
Reconstructed sequence for retroflex /ô/ “roll” t p 0 9 6 Conditional density modes , mean , truth . ground F1 (nmodes = 5) F2 (nmodes = 5) F3 (nmodes = 5) F4 (nmodes = 5) F5 (nmodes = 4) 20 20 20 20 20 0 0 0 0 0 −20 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 F6 (nmodes = 4) F7 (nmodes = 3) F8 (nmodes = 3) F9 (nmodes = 3) 20 20 20 20 0 0 0 0 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 Reconstructions: dpmode , mean , rbf , truth . ground 20 20 20 20 20 0 0 0 0 0 −20 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 20 20 20 20 0 0 0 0 −20 −20 −20 −20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 −60 −40 −20 0 20 p. 11
Recommend
More recommend