Unsupervised neural feature learning for speech using weak top-down - PowerPoint PPT Presentation

Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/

Success in speech recognition 1 / 18

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] 1 / 18

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 18

Success in speech recognition [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 18

Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text 1 / 18

Success in speech recognition i had to think of some example speech since speech recognition is really cool [Xiong et al., arXiv’16]; [Saon et al., arXiv’17] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • An addiction to labels : 2000 hours transcribed speech audio; ∼ 350M/560M words text • But, there are around 7000 languages spoken in the world today 1 / 18

Why learn without labels? 3 / 18

Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] 3 / 18

Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] 3 / 18

Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] 3 / 18

Why learn without labels? • Get insight into human language acquisition [R¨ as¨ anen and Rasilo, ’15] • Language acquisition in robots [Roy, ’99]; [Renkens and Van hamme, ’15] • Analysis of audio for unwritten languages [Besacier et al., ’14] • New insights and models for speech processing [Jansen et al., ’13] 3 / 18

Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 18

Example: Query-by-example search [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

Example: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

Example: Query-by-example search Spoken query: Useful speech system, not requiring any transcribed speech [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 5 / 18

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 6 / 18

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 6 / 18

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 6 / 18

Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 6 / 18

Unsupervised frame-level representation learning: The Correspondence Autoencoder

Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater

Supervised representation learning using DNNs Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

Supervised representation learning using DNNs Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

Supervised representation learning using DNNs Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 8 / 18

Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 9 / 18

Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 9 / 18

Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 9 / 18

Unsupervised term discovery (UTD) 10 / 18

Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 10 / 18

Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 11 / 18

Autoencoder (AE) Reconstruct input Input speech frame 12 / 18

Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 13 / 18

Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 13 / 18

Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 13 / 18

Correspondence autoencoder (cAE) Frame from other word in pair Play Unsupervised feature extractor Play Frame from one word 14 / 18

Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 18

Evaluation: Query-by-example search Spoken query: [Jansen and Van Durme, IS’12; Saeb et al., IS’17; Settle et al., IS’17] 16 / 18

Evaluation: Isolated word query-by-example 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 17 / 18

Evaluation: Isolated word query-by-example 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 17 / 18

Summary and conclusion • Introduced correspondence autoencoder (cAE) for unsupervised frame-level representation learning • Uses top-down information from unsupervised term discovery system • Uses bottom-up initialization on large speech corpus • Unsupervised neural network model that combines top-down and bottom-up information results in large intrinsic improvements • Links with language acquisition research • Future: More analysis; different domains; practical search systems 18 / 18

http://www.kamperh.com/ https://github.com/kamperh

Evaluation of features: same-different task

Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like”

Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple”

Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like”

Evaluation of features: same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like”

Unsupervised neural feature learning for speech using weak top-down - PowerPoint PPT Presentation

Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/ Success in speech recognition 1 / 18 Success in speech

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Unsupervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov.

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

Algorithms, Abstraction, and Functions Abstraction "The essence of abstractions is

Part II: Processes and Events Antony Galton Department of Mathematics and Computer Science

Math 20, Fall 2017 Edgar Costa Week 7 Dartmouth College Edgar Costa Math 20, Fall 2017 Week 7

CS686: High-level Motion/Path Planning Applications Sung-Eui Yoon ( ) Course URL:

Russian Language Color Slides Utube Lectures from Native Speaker: Russian Language Color Slides

Neutron beam test at LANL Guang Yang 12/10/19 SAND bi-weekly 1 / 11 Introduction A key

Unsupervised neural feature learning for speech using weak top-down - PowerPoint PPT Presentation

Unsupervised neural feature learning for speech using weak top-down constraints Maties Machine Learning (MML), Oct. 2017 Herman Kamper Stellenbosch University http://www.kamperh.com/ Success in speech recognition 1 / 18 Success in speech

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

UNSUPERVISED LEARNING, CLUSTERING UNSUPERVISED LEARNING UNSUPERVISED LEARNING Supervised

Unsupervised Learning and Clustering l In unsupervised learning you are given a data set with no

Decision Tree Prof. Seungchul Lee Industrial AI Lab. Feature Test Feature 1 Feature 2 Feature

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Neural Information Retrieval Wassila Lalouani 1 Plan Neural network architectures Neural

Unsupervised Learning Andrea Passerini passerini@disi.unitn.it Machine Learning Unsupervised

Unsupervised Learning in Neural Networks Keith L. Downing The Norwegian University of Science and

Unsupervised speech processing using acoustic word embeddings Herman Kamper School of

Introduction to PCA Unsupervised Learning in R Unsupervised learning Two methods of

4CSLL5 Parameter Estimation (Supervised and Unsupervised) Unsupervised Maximum Likelihood

Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov.

A Distinctive Feature of A Distinctive Feature of A Distinctive Feature of A Distinctive Feature

Outline Reducing Dimensionality Feature Selection 1 Steven J Zeil Feature Extraction 2

Unsupervised Language Learning: Representation Learning for NLP Katia Shutova ILLC University

Unsupervised Learning Unsupervised Learning Learning without Class Labels (or correct Learning

DATA MINING LECTURE 5 Similarity and Distance Sketching, Locality Sensitive Hashing SIMILARITY

DATA MINING LECTURE 4 Similarity and Distance Recommender Systems SIMILARITY AND DISTANCE

Algorithms, Abstraction, and Functions Abstraction &quot;The essence of abstractions is

Part II: Processes and Events Antony Galton Department of Mathematics and Computer Science

Math 20, Fall 2017 Edgar Costa Week 7 Dartmouth College Edgar Costa Math 20, Fall 2017 Week 7

CS686: High-level Motion/Path Planning Applications Sung-Eui Yoon ( ) Course URL:

Russian Language Color Slides Utube Lectures from Native Speaker: Russian Language Color Slides

Neutron beam test at LANL Guang Yang 12/10/19 SAND bi-weekly 1 / 11 Introduction A key

Algorithms, Abstraction, and Functions Abstraction "The essence of abstractions is