Continuous Fundamental Frequency Prediction with Deep Neural - PowerPoint PPT Presentation

Continuous Fundamental Frequency Prediction with Deep Neural Networks Bálint Pál Tóth, Tamás Gábor Csapó csapot@tmit.bme.hu Budapest University of Technology and Economics http://smartlab.tmit.bme.hu

Introduction Deep Learning: New era of machine learning Feed forward deep neural networks Speech research  Speech recognition  Speech coding  Speech synthesis: using parametric vocoders  spectral components,  phone durations,  fundamental frequency (= pitch = F0). 2/22

Fundamental frequency prediction Rule based F0 prediction Statistical / machine learning approach  Hidden Markov Models (HMM)  Feed forward deep neural networks (DNN) Pitch tracking algorithm  Vanilla: Standard F0 tracking + voiced/unvoiced tagging  Difficulty in modeling standard F0  Discontinuity in unvoiced regions  Multi-Space Distribution Hidden Markov Models (MSD-HMM)  Proposed: Continuous F0 + Maximum Voiced Frequency  No discontinuity, less difficulty in modeling 3/22

Continuous Pitch Tracking ’I saw it all myself, and it was splendid.’ 4/22

Goal Investigati tigation on of 1) feed forward deep neural networks modeling power, 2) model complexity of vanilla and continuous F0 trajectories Hypothesis thesis Perceptual quality of DNN-based prediction using continuous F0 will be superior to discontinuous F0 5

Vocoder methods I: Standard F0 (baseline) Pulse-noise vocoder SWIPE pitch tracking algorithm (Camacho & Harris 2008) 2 parameters for every 25 ms long (5 ms shift) window:  F0 value for voiced regions  For DNN, linear interpolation in unvoiced regions  Voiced / unvoiced binary flag Denoted ted by F0 F0std td 6/22

Vocoder methods II: Continuous F0 Residual-based continuous vocoder SSP pitch tracking algorithm (Garner et al., 2013) 2 parameters for every 25 ms long (5 ms shift) window:  F0 value for all regions  Maximum Voiced Frequency (MVF)  Voiced-unvoiced frequency boundary Denoted ted by F0 F0cont nt 7/22

Machine Learning Methods: HMM Widespread statistical parametric speech synthesis approach Vocoder I  F0std training and prediction (with MSD-HMM) Vocoder II  F0cont & MVF training and prediction HTS 2.3 with default settings 8/22

Machine Learning Methods: DNN Feed forward deep neural networks  Mean Square Error (MSE) cost function  ADADELTA optimization with mini-batches  Parametric Rectified Linear Units (PReLU) as activation function for hidden layers  Sigmoid activation function for the outputs  Weight initialization:  Xavier’s weight input -hidden and hidden-output layers  Orthogonal in the hidden layers  Dropout w/ 50% after each layer except output layer  Early stopping after 50 epochs 9/22

Proposed DNN network 10/22

DNN Inputs Parameter-wise transformed to zero mean and unit variance Feature name # Type Quinphone 5*68 One-hot Number of phonemes/syllables/words/phrases in the 4*3 Numerical previous/current/next syllable/word/phrase/sentence Number of syllables/words in the current sentence 2 Numerical Forward/backward position of the actual phoneme/syllable/ 2*3 Numerical word/phrase in the syllable/word/phrase/sentence Phone boundaries 2 Numerical Percentual position of the actual frame within the phone 1 Numerical Altogether: 363 11/22

DNN Outputs Normalized to 0.01…0.99 for sigmoid activation System Feature name # Type LogF0 1 Continuous (interpolated) F0std V/UV flag 1 Binary LogF0 1 Continuous F0cont MVF 1 Continuous 12/22

Evaluation: hyperparameter optimization One male and one female speaker from Precisely Labelled Hungarian Database (PLHD) 1984 utterances / speaker (~2 hours) Training-validation-test sets: 80-15-5% Hyperparameter optimization with male speaker:  #hidden layers: 1..7  #neurons / layer: 80..2048  #mini-batch size: 8..256 Validation loss was measured. 64 neural nets for F0std and 73 for F0cont Top 5 were selected and run with female speaker 13/22

Hyperopt results: standard F0 # Hidden Validation ID # Neurons Epochs Layers MSE 3 350 61 0.01076 F0std-1 3 650 32 0.01078 F0std-2 3 900 30 0.01089 F0std-3 3 950 36 0.01099 F0std-4 3 800 37 0.01103 F0std-5 Mini-batch size = 128 14/22

Hyperopt results: continuous F0 # Hidden Validation ID Layers # Neurons Epochs MSE 3 160 2 0.00239 F0cont-1 3 80 67 0.00346 F0cont-2 1 128 2 0.00349 F0cont-3 3 70 12 0.00352 F0cont-4 2 100 28 0.00356 F0cont-5 Mini-batch size = 8 15/22

Objective evaluation I Mean correlation between natural F0 and modeled F0 (higher value: larger similarity between compared F0 trajectories) 16/22

Objective evaluation II Mean RMSE between natural F0 and modeled F0 (higher value: larger difference between compared F0 trajectories) 17/22

Subjective evaluation I Goal: measure the perceived intonation of sentences Web-based MUSHRA test:  Reference natural sentence,  Vocoded sentence with F0 from  Natural utterance  F0std  F0cont  HMM  F0std  F0cont  DNN  F0std  F0cont  Benchmark: vocoded with F0=0 18/22

Subjective evaluation II Sentences with highest RMSE were selected 2 speakers × 8 types × 5 sentences (altogether 80 sent.) Randomized order 18 test subjects (9 females, 9 males) 13 minutes to complete the test (avg) 19/22

Subjective evaluation III (higher value: more similar to natural) 20/22

Conclusions and discussion 1) F0cont can be predicted better than F0std with HMMs and DNNs 2) Simpler DNN models for F0cont (good for embedded systems) 3) F0cont has faster convergence (we measured cca. 7x faster than F0std) 4) Simple DNN approaches the F0 modeling capacity of state-of-the-art HMM  contin tinuous uous represe sentat ntation ion of F0 F0 forms ms a a less comp mplex ex system tem th than an th the V/U /UV bas ased F0 F0std td 21/22

Thanks for listening! csapot@tmit.bme.hu http://smartlab.tmit.bme.hu

Continuous Fundamental Frequency Prediction with Deep Neural - PowerPoint PPT Presentation

Continuous Fundamental Frequency Prediction with Deep Neural Networks Blint Pl Tth, Tams Gbor Csap csapot@tmit.bme.hu Budapest University of Technology and Economics http://smartlab.tmit.bme.hu Introduction Deep Learning: New era

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Continuous Improvement Continuous Improvement Update on Continuous Improvement Process Update on

Computer Graphics Spectral Analysis Philipp Slusallek Spatial Frequency Frequency

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Authentication Frequency (and Continuous Authentication) Mike Just Interactive and Trustworthy

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Univ iversity of Houston Downtown Contin inuity of Operations Pla lan (COOP) Lia iaison

Energetics with ELVAC Energetics with ELVAC ELVAC development for energy sector Year 1998 MCS

Understanding Lateral Tunneling Understanding Lateral Tunneling Accelerometer and The

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Effect of Stiffness Eccentricity on Seismic Response of Simple Irregular Structures Juseung Ryu,

A A st study y of Ju Jump mper r FIV V due to mu mult ltip iphase se in intern rnal

FUSE 2020 PROGRAM If you need this document in another format, please email curca@ung.edu or call

Trusted source for electronic & wound components, wire harness Application notes by ITP..

Continuous Fundamental Frequency Prediction with Deep Neural - PowerPoint PPT Presentation

Continuous Fundamental Frequency Prediction with Deep Neural Networks Blint Pl Tth, Tams Gbor Csap csapot@tmit.bme.hu Budapest University of Technology and Economics http://smartlab.tmit.bme.hu Introduction Deep Learning: New era

Frequency Decomposition The base frequency or the fundamental frequency is the lowest frequency.

Time-Frequency Analysis Time Frequency Analysis in Visual Signal Yetmen Wang AnCAD, Inc.

Structured Prediction Introduction What is structured prediction? CS 6355: Structured Prediction

Branch Prediction Branch Prediction vs vs Execution Time Execution Time Prediction

AGN deep multiwavelength AGN deep multiwavelength AGN deep multiwavelength surveys: surveys:

Continuous Descent Operation (CDO) Continuous Descent Operation (CDO) Doc 9331 Doc 9331 Erwin

Continuous Improvement Continuous Improvement Update on Continuous Improvement Process Update on

Computer Graphics Spectral Analysis Philipp Slusallek Spatial Frequency Frequency

Relational Deep Learning: A Deep Latent Variable Model for Link Prediction Hao Wang, Xingjian

Using lasso and related estimators for prediction Di Liu StataCorp July 12, 2019 1 / 20

Prediction and Odds 18.05 Spring 2017 Probabilistic Prediction Also called probabilistic

Using Stata 16s lasso features for prediction and inference Di Liu StataCorp 1 / 50

CS 104 Computer Organization and Design Branch Prediction CS104:Branch Prediction 1 Branch

Exercise 7a: Additional Intra Prediction Modes Implement Additional Block Prediction Modes Add

Authentication Frequency (and Continuous Authentication) Mike Just Interactive and Trustworthy

Hao Su July 6, 2017 Outline Overview of 3D deep learning 3D deep learning algorithms

Univ iversity of Houston Downtown Contin inuity of Operations Pla lan (COOP) Lia iaison

Energetics with ELVAC Energetics with ELVAC ELVAC development for energy sector Year 1998 MCS

Understanding Lateral Tunneling Understanding Lateral Tunneling Accelerometer and The

ARM Microcontroller Course June 3, 2015 ARM Microcontroller Course The Course Direct Digital

Effect of Stiffness Eccentricity on Seismic Response of Simple Irregular Structures Juseung Ryu,

A A st study y of Ju Jump mper r FIV V due to mu mult ltip iphase se in intern rnal

FUSE 2020 PROGRAM If you need this document in another format, please email curca@ung.edu or call

Trusted source for electronic &amp; wound components, wire harness Application notes by ITP..

Trusted source for electronic & wound components, wire harness Application notes by ITP..