A perceptual investigation of wavelet-based decomposition of f0 for - PowerPoint PPT Presentation

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1 / 29

Overview Introduction Motivation Hypotheses Experiments Experiment 2 Experiment 3 Discussion Summary References 2 / 29

Introduction • Wavelets in Speech Processing [Farouk, 2014] • Annotation of prominence [Vainio et al, 2013] • Pre-processing step for f0 modeling [Suni et al, 2013], [Ribeiro and Clark, 2015] • Voice Conversion [Sanchez et al, 2014] • Modeling of f0 • General conclusion indicate that signal decomposition is beneficial for f0 modeling • But it is assumed that all components are equally relevant to the reconstructed signal The individual importance of each wavelet scale to the overall signal is not fully understood 3 / 29

Introduction • Wavelets in Speech Processing [Farouk, 2014] • Annotation of prominence [Vainio et al, 2013] • Pre-processing step for f0 modeling [Suni et al, 2013], [Ribeiro and Clark, 2015] • Voice Conversion [Sanchez et al, 2014] • Modeling of f0 • General conclusion indicate that signal decomposition is beneficial for f0 modeling • But it is assumed that all components are equally relevant to the reconstructed signal The individual importance of each wavelet scale to the overall signal is not fully understood 4 / 29

The Continuous Wavelet Transform • The CWT decomposes an input signal into various scales of selected frequency. • 10-scale decomposition • Each scale approximately 1 octave apart. • f0 reconstruction: • [Suni et al, 2013] 10 � C i ( x )( i + 2 . 5) − 5 / 2 f 0 ( x ) = i =1 5 / 29

Hypotheses Middle frequencies (scales 5-8) are associated with higher levels of naturalness Low frequencies (scales 1-4) don’t contain much information and are comparable to HMM-generated f0 High frequencies (scales 9-10) are mostly noise and do not contribute much to the per- ceived naturalness 6 / 29

Conditions and Reconstruction Experimental conditions Condition Description Freq. (Hz) natural Vocoded speech using natural parameters - all All f0 frequencies. 0.1-50 1-2 Low frequencies. Scales indexed at 1 and 2. 0.1-0.2 3-4 Low frequencies. Scales indexed at 3 and 4. 0.4-0.8 1-4 All low frequencies. Scales indexed at 1, 2, 3, and 4. 0.1-0.8 5-6 Middle frequencies. Scales indexed at 5 and 6. 1.6-3.2 7-8 Middle frequencies. Scales indexed at 7 and 8. 6.3-13 5-8 All middle frequencies. Scales indexed at 5, 6, 7, and 8. 1.6-13 9-10 High frequencies. Scales indexed at 9 and 10. 25-50 MSD-HMM f0 signal predicted from an MSD-HMM. - Table: Experimental conditions with approximate CWT frequency ranges. F0 reconstruction 10 � w i C i ( x )( i + 2 . 5) − 5 / 2 f 0 ( x ) = i =1 where w i is the weight given to scale i where w i ∈ { 0 , 1 } 7 / 29

Perceptual Experiments Experiment 1 - Prominence Detection Participants are asked to judge which word appears more prominent in an utterance. Experiment 2 - Similarity Experiment Participants are asked to judge utterances in terms of similar naturalness. Experiment 3 - MUSHRA Test Participants are asked to rate an utterance on a 100-point scale with respect to an reference and all other conditions. Experiment 4 - MOS Test Participants are asked to rate an utterance on a 5-point scale with no references. 8 / 29

Experiment 2 - Similarity • Data • Expressive audiobook data • mel-cepstral, aperiodicity, voicing parameters • HMM system trained on roughly 5000 utterances • duration is force-aligned • natural (vocoded) condition uses original parameters • remaining conditions use natural f0 processed accordingly • Design • 20 utterances synthesized for each of the 10 conditions • 10 native listeners. Each rating 144 utterance pairs • Each pair consists of different utterances and different conditions • No repetitions (utterance or condition) within any three consecutive pairs • Participants asked to judge if the pair is similar or different in terms of naturalness 13 / 29

Experiment 2 - Similarity • 45 distinct condition pairs, each pair judged at least 32 times • Create 10x10 dissimilarity matrix and embed it into a 2-dimensional space with MDS • Kruskal’s normalized stress1 with stress value of 0.086 14 / 29

Experiment 2 - Similarity • Listeners naturally clustered low, middle, and high frequencies • All frequencies seems to be similar to the middle frequencies • It is also farther from natural speech than middle frequencies • Listeners tend do prefer the CWT middle frequencies • These have been previously associated with the word level 15 / 29

Experiment 3 - MUSHRA • Data • Expressive audiobook data (same as Similarity Experiment). • Design • Ask participants to judge all conditions simultaneously (1 to 100). • Reference is given as the natural condition. • 10 Participants rate 20 sets of 10 stimuli. • From the 200 expected sets, 48 were discarded as the hidden reference was not judged as natural. • 152 sets were used for analysis. 17 / 29

Experiment 3 - MUSHRA 18 / 29

Main Conclusions • Mid-frequencies • Consistently achieve better results • Naturalness is almost comparable to all frequencies • Have been associated previously with the word-level [Suni et al, 2013], [Ribeiro and Clark, 2015] • Low-frequencies • Comparable to HMM generated f0 (Prominence, MUSHRA, MOS tests) • Although not really the same (similarity test) • Previously associated with phrase and utterance levels • High-frequencies • Consistently judged the most unnatural condition • Not really relevant to naturalness • Previously associated with the phone-level 19 / 29

Earlier assumptions Earlier assumptions [Suni et al, 2013], [Ribeiro and Clark, 2015] 1 All wavelet components are equally relevant to the reconstructed signal 2 The association of wavelet components to linguistic levels is meaningful • First assumption shown not to be true. • Middle frequencies carry most of the information • Low and high frequencies not so relevant • How about their association with linguistic levels? 20 / 29

Unit and Peak Rates • Compute unit and peak rates at utterance level for 5000 utterances • Count units and peaks (local maxima) and divide by utterance duration in seconds 21 / 29

Summary • Main Findings • Wavelet components do not carry equal weights for the f0 signal • Middle frequencies convey most of the information • HMM-generated f0 is somewhat similar to low-frequencies • Association with linguistic levels is not very good • Speech Samples • http://homepages.inf.ed.ac.uk/s1250520/samples/interspeech15.html • Future Work • Associate each scale with meaningful linguistic-levels • Use middle frequencies to learn relevant syllable and word-level features 22 / 29

A perceptual investigation of wavelet-based decomposition of f0 for - PowerPoint PPT Presentation

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1

Recall 1 Wavelet coefficients of images are Laplacian distributed! The various wavelet

Investigation of Thermal Decomposition Investigation of Thermal Decomposition Process of

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

The Haar Wavelet Transform: Compression and Adams and Halsey Reconstruction Patterson Damien

Multi-D wavelet construction using Quillen-Suslin theorem for Laurent polynomials Youngmi Hur

Wavelet Scattering Transforms Haixia Liu Department of Mathematics The Hong Kong University of

Empirical Mode Decomposition, Lifting and Block Wavelet Transform April 9 Empirical Mode

Case Investigation of Avian in Southeast Asia Influenza Overview Initiating an investigation

2018-02-12 Perceptual organization PSY 525.001 Vision Science 2018 Spring Rick Gilmore

Emotion Recognition in Speech under Environmental Noise Conditions using Wavelet Decomposition

Morphing and wavelet EnKF data assimilation Jan Mandel Based on joint work with J. D. Beezley, L.

FOR C ONSERVATION L AWS : N UMERICAL ANALYSIS Margarete O. Domingues 1 , S onia M. Gomes 2 ,

h"p://icv.ims.ut.ee shb@ut.ee Conventional

BF BFA 2017 Wave velet tra ransf sforma rmation c 0 , c 1 , c 2 , c 3 , c 4 , c 5 , c 6 , c 7

Locality and Smoothness or Wavelets and Splines May 2, 2018 Wavelet - a small wave Fitting a

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Chapter 6: Designing a Pipelined CPU What are our resources? 1 washer, 1 dryer, 1 folder

DLX Pipeline 2-stage fully pipelined Adder 4-stage fully pipelined Multiplier 5-cycle

BUILDING INCLUSIVE ECONOMIES Advancing the research agenda Kay McGowan Global Development Lab

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

A perceptual investigation of wavelet-based decomposition of f0 for - PowerPoint PPT Presentation

A perceptual investigation of wavelet-based decomposition of f0 for text-to-speech synthesis M. Sam Ribeiro, Junichi Yamagishi, Robert A. J. Clark School Of Informatics The University of Edinburgh m.f.s.ribeiro@sms.ed.ac.uk 8 September 2015 1

Recall 1 Wavelet coefficients of images are Laplacian distributed! The various wavelet

Investigation of Thermal Decomposition Investigation of Thermal Decomposition Process of

Optimizing Discrete Wavelet Transform Optimizing Discrete Wavelet Transform on the Cell Broadband

9.4 Local Perception Filters 9.4 Local Perception Filters Exploiting Exploiting Perceptual

Laboratory Investigation of Laboratory Investigation of Laboratory Investigation of Laboratory

Thermal decomposition of the Thermal decomposition of the Thermal decomposition of the Thermal

Polar Decomposition of a Matrix Garrett Buffington May 4, 2014 The Polar Decomposition SVD and

A wavelet based approach to climate biome clustering Derek Desantis University of Nebraska -

The Haar Wavelet Transform: Compression and Adams and Halsey Reconstruction Patterson Damien

Multi-D wavelet construction using Quillen-Suslin theorem for Laurent polynomials Youngmi Hur

Wavelet Scattering Transforms Haixia Liu Department of Mathematics The Hong Kong University of

Empirical Mode Decomposition, Lifting and Block Wavelet Transform April 9 Empirical Mode

Case Investigation of Avian in Southeast Asia Influenza Overview Initiating an investigation

2018-02-12 Perceptual organization PSY 525.001 Vision Science 2018 Spring Rick Gilmore

Emotion Recognition in Speech under Environmental Noise Conditions using Wavelet Decomposition

Morphing and wavelet EnKF data assimilation Jan Mandel Based on joint work with J. D. Beezley, L.

FOR C ONSERVATION L AWS : N UMERICAL ANALYSIS Margarete O. Domingues 1 , S onia M. Gomes 2 ,

h&quot;p://icv.ims.ut.ee shb@ut.ee Conventional

BF BFA 2017 Wave velet tra ransf sforma rmation c 0 , c 1 , c 2 , c 3 , c 4 , c 5 , c 6 , c 7

Locality and Smoothness or Wavelets and Splines May 2, 2018 Wavelet - a small wave Fitting a

Biorthogonal Filter Pairs und Wavelets WTBV January 20, 2016 WTBV Biorthogonal Filter Pairs und

Chapter 6: Designing a Pipelined CPU What are our resources? 1 washer, 1 dryer, 1 folder

DLX Pipeline 2-stage fully pipelined Adder 4-stage fully pipelined Multiplier 5-cycle

BUILDING INCLUSIVE ECONOMIES Advancing the research agenda Kay McGowan Global Development Lab

Explore More Topics

Sambuz

Useful Links

Newsletter

Mail Us

h"p://icv.ims.ut.ee shb@ut.ee Conventional