Almost Unsupervised Text to Speech and Automatic Speech Recognition - PowerPoint PPT Presentation

Jan 12, 2023 •261 likes •400 views

Almost Unsupervised Text to Speech and Automatic Speech Recognition Yi Ren, Xu Tan, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu Microsoft Research Zhejiang University Motivation ASR and TTS can achieve good performance given large amount

Almost Unsupervised Text to Speech and Automatic Speech Recognition Yi Ren*, Xu Tan*, Tao Qin, Sheng Zhao, Zhou Zhao, Tie-Yan Liu Microsoft Research Zhejiang University
Motivation • ASR and TTS can achieve good performance given large amount of paired data. However, there are many low-resource languages in the world that are lack of supervised data to build TTS and ASR systems. • We propose a practical way to leverage few paired data and additional unpaired speech and text data to build TTS and ASR systems.
Model Architecture
Denoising Auto-Encoder • We adopt denosing auto-encoder to build these capabilities. (Green and yellow lines) • Representation extraction: how to understand the speech or text sequence. • Language modeling: how to model and generate sequence in speech and text domain. DAE (Speech) DAE (Text) I xx a boy. I am a boy.
Dual Transformation • Dual transformation is the key component to leverage the dual nature of TTS and ASR, and develop the capability of speech-text conversion. TTS (inference) ASR (inference) I am a boy. I love ASR ASR (train) TTS (train) I am a boy. I love ASR
Bidirectional Sequence Modeling • Sequence generation suffers from error propagation problem , especially for the Speech sequence, which is usually longer than text. • Due to dual transformation, the later part of the sequence is always of low quality. • We propose the bidirectional sequence modeling (BSM) that generates the sequence in both left-to-right and right-to-left directions. TTS (train) ASR (train) I am a boy. I am a boy. TTS (train) ASR (train) yob a ma i yob a ma i
Audio Samples Printing then for our purpose A further development of may be considered as the art of the Roman letter took Text making books by means of place at Venice. movable types. Paired- 200 Our method
Results Our Method : leverages 200 paired data + 12300 unpaired data Pair-200 : leverages only 200 paired data Supervised : leverages all the 12500 paired data GT : the ground truth audio GT (Griffin-Lim) : the audio generated from ground truth mel-spectrograms using Griffin-Lim algorithm
Results The higher, the better The smaller, the better • Our method only leverages 200 paired speech and text data, and additional unpaired data • Greatly outperforms the method only using 200 paired data • Close to the performance of supervised method (using 12500 paired data)
Thanks!
Experiments • Training and evaluation setup • Datasets • LJSpeech contains 13100 audio clips and transcripts, approximately 24 hours. • Evaluation • TTS: Intelligibility Rate and MOS (mean opinion score) • ASR: PER (phoneme error rate)
Analysis • Ablation Study on different components of our method
Analysis The smaller, the better The higher, the better 3 80 70 2.5 60 2 50 40 1.5 30 1 20 0.5 10 0 0 MOS (TTS) PER (%) (ASR)

Recommend

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text

10 slides that always work Simple text boxes (I) Sample text Sample text Sample text Sample text Sample text Sample text Sample text Sample text Sample text Sample text Sample text Sample

207 views • 10 slides

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs Text Speech vs Text Same but different Same but different Core Speech Technologies Core Speech Technologies Speech Recognition Speech

705 views • 38 slides

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis From text to speech From text to speech Text Analysis Text Analysis Strings of characters to words Strings of characters to words

667 views • 25 slides

EECS E6870 converting speech to text Speech Recognition automatic speech recognition

What Is Speech Recognition? EECS E6870 converting speech to text Speech Recognition automatic speech recognition (ASR), speech-to-text (STT) what its not Michael Picheny,

345 views • 22 slides

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 25: Speech synthesis (Concluding lecture) Instructor: Preethi Jyothi Nov 6, 2017 Recall: SPSS framework O Speech Speech Train Parameter

273 views • 26 slides

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised learning: X - y pairs, f(x) function approximation Unsupervised learning: only X, no y Exploring the space of X measurements,

441 views • 14 slides

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here

CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here Enter Text Here Enter Text Here CONTENT TITLE Insert Subtitle Here Enter Text Here Enter Text Here Enter Text Here Enter Text

697 views • 66 slides

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone Sequence To Speech Articulatory Approaches Concatenative Approaches HMM-based Approaches Rule-Based Approaches 1 Speech Synthesis Concept

749 views • 57 slides

Automatic text classification and extraction of Automatic text classification and extraction of

Automatic text classification and extraction of Automatic text classification and extraction of entities and their properties from the text entities and their properties from the text Anton Kolonin Webstructor project 2015, SIBIRCON/SibMedInfo

1.01k views • 16 slides

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech

Automatic Speech Recognition (CS753) Automatic Speech Recognition (CS753) Lecture 23: Speech Synthesis (Part I) Instructor: Preethi Jyothi Oct 30, 2017 T ext- T o- S peech Systems Storied History Von Kempelens speaking machine (1791)

290 views • 8 slides

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

4CSLL5 Parameter Estimation (Supervised and Unsupervised) 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Outline 4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood (re-)Estimation Hidden

310 views • 8 slides

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B

Post-Conference Presentation Sunday Oladayo Oladejo Table of Content A Introduction B Benefits C Take-Aways D Research Areas Add text add text add text add text add text add text add text add text add text add text add text E Research

513 views • 12 slides

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no output classifications (labels) l Clustering is an important type of unsupervised learning PCA was another type of unsupervised learning l The

858 views • 38 slides

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis From text to speech Text Analysis Strings of characters to words Linguistic Analysis From words to pronunciations and prosody

490 views • 25 slides

Automatic Verification of Automatic Verification of Automatic Verification of Automatic

Automatic Verification of Automatic Verification of Automatic Verification of Automatic Verification of Competitive Stochastic Systems Competitive Stochastic Systems Competitive Stochastic Systems Competitive Stochastic Systems Marta

500 views • 29 slides

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic

Speech and Language CS 188: Artificial Intelligence Speech technologies Automatic speech recognition (ASR) Text-to-speech synthesis (TTS) Dialog systems Language processing technologies Lecture 18: Speech

193 views • 3 slides

Levels of Dialect, Cont. Linguis4cs 159 American Dialects

Levels of Dialect, Cont. Linguis4cs 159 American Dialects October 2, 2014 Rickford, 1985 Gullah/Sea Island Creole Pidgin = language variety induced

418 views • 27 slides

Examining OPEB Trends A Panel Discussion Carrie Lombardo, Chief Marketing and Employer

Examining OPEB Trends A Panel Discussion Carrie Lombardo, Chief Marketing and Employer Services Officer MERS Tara Tyler, Benefit Plan Advisor MERS Marie Stiegel, Associate Plante Moran Jim Ritsema, City Manager City of

716 views • 56 slides

Guidelines on Writing Philosophy Matthias Brinkmann 1 General In the words of Jim Pryor, a

Guidelines on Writing Philosophy Matthias Brinkmann 1 General In the words of Jim Pryor, a philosophy paper consists of the reasoned defence of some claim (see link below). This ex- cludes two kinds of papers from being acceptable: fjrst,

710 views • 3 slides

Businesses Need From Policymakers in COVID-19 WEDNESDAY, SEPTEMBER 9, 2020 R E B U I L D I N G

What New & Small Businesses Need From Policymakers in COVID-19 WEDNESDAY, SEPTEMBER 9, 2020 R E B U I L D I N G B E T T E R @ S T A R T U S U P N O W @ K A U F F M A N F D N Todays Speakers JASON WIENS CAROLINE CUMMINGS MEAGAN

542 views • 30 slides

Towards Unsupervised Speech-to-Text Translation Yu-An Chung Wei-Hung Weng Schrasing Tong

Towards Unsupervised Speech-to-Text Translation Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA ICASSP

1.17k views • 16 slides

Speech Recognition and Synthesis for Conversational AI Mari Ostendorf University of Washington

Speech Recognition and Synthesis for Conversational AI Mari Ostendorf University of Washington EE596 Spring 2018 Dialogue System Components A Speech Language P Recognition Understanding P L I Dialogue C Management A T I

907 views • 42 slides

Engineering ............ design of the physical ? ....early 90s What do we live for? Is

Engineering ............ design of the physical ? ....early 90s What do we live for? Is technology really for the greater good? Am I going to spend my life with machines? Inspiring moments.......... 1997 I am going to build my blind

538 views • 31 slides

Interchangeable Modalities W3C Workshop on MultiModal Interaction 22-23 July 2013, New York

Interchangeable Modalities W3C Workshop on MultiModal Interaction 22-23 July 2013, New York Background: iSpeech is a Text-to-Speech & Speech Recognition Company Enterprise Enterprise Mobile, Auto, Home, & Fast growing list of

337 views • 17 slides