Unsupervised neural and Bayesian models for zero-resource speech processing MIT CSAIL, 15 Nov. 2016 Herman Kamper University of Edinburgh; TTI at Chicago http://www.kamperh.com
Speech recognition success 1 / 35
Speech recognition success 1 / 35
Speech recognition success 1 / 35
Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) 1 / 35
Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text 1 / 35
Speech recognition success [Xiong et al., arXiv’16] • Google Voice: English, Spanish, German, . . . , Zulu ( ∼ 50 languages) • Data: 2000 hours of labelled speech audio; ∼ 350M words of text • But: Can we do this for all 7000 languages spoken in the world? 1 / 35
Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology 2 / 35
Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem 2 / 35
Unsupervised speech processing Developing unsupervised methods that can learn structure directly from raw speech audio, i.e. zero-resource technology Criticism: Always some data; semi-supervised problem Reasons for purely unsupervised case: • Modelling infant language acquisition [R¨ as¨ anen, SpecCom’12] • Language acquisition in robotics [Renkens and Van hamme, IS’15] • Analysis of audio for unwritten languages [Besacier et al., SpecCom’14] • New insights and models for speech processing [Jansen et al., ICASSP’13] 2 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : 3 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : f a ( · ) 3 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 3 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level representation learning : Cool model f a ( · ) f a ( · ) 2. Unsupervised segmentation and clustering : How do we discover meaningful units in unlabelled speech? 3 / 35
Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35
Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35
Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35
Unsupervised term discovery (UTD) [Park and Glass, TASLP’08] 4 / 35
Full-coverage segmentation and clustering 5 / 35
Full-coverage segmentation and clustering 5 / 35
Full-coverage segmentation and clustering 5 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering 6 / 35
Unsupervised speech processing: Two problems 1. Unsupervised frame-level Cool model representation learning : f a ( · ) 2. Unsupervised segmentation and clustering : We focus on full-coverage segmentation and clustering Our claim: Unsupervised speech processing benefits from both top-down and bottom-up modelling 6 / 35
Top-down and bottom-up modelling Top-down: Use knowledge of higher-level units to learn about lower-level parts Bottom-up: Piece together lower-level parts to get more complex higher-level structures [Feldman et al., CCSS’09] 7 / 35
Unsupervised frame-level representation learning: The Correspondence Autoencoder
Unsupervised frame-level representation learning: The Correspondence Autoencoder Micha Elsner Daniel Renshaw Aren Jansen Sharon Goldwater
Supervised representation learning using DNN Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35
Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35
Supervised representation learning using DNN Output: predict phone states ay ey k v Phone classifier learned jointly Unsupervised modelling: No phone class targets to train network on Feature extractor f a ( · ) learned from data Input: speech frame(s) e.g. MFCCs, filterbanks 9 / 35
Autoencoder (AE) neural network Reconstruct input Input speech frame [Badino et al., ICASSP’14] 10 / 35
Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? Input speech frame [Badino et al., ICASSP’14] 10 / 35
Autoencoder (AE) neural network Reconstruct input • Completely unsupervised • But purely bottom-up • Can we use top-down information? • Idea: Unsupervised term discovery Input speech frame [Badino et al., ICASSP’14] 10 / 35
Unsupervised term discovery (UTD) 11 / 35
Unsupervised term discovery (UTD) Can we use these discovered word pairs to give weak top-down supervision? 11 / 35
Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35
Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35
Weak top-down supervision: Align frames [Jansen et al., ICASSP’13] 12 / 35
Autoencoder (AE) Reconstruct input Input speech frame 13 / 35
Correspondence autoencoder (cAE) Frame from other word in pair Frame from one word 14 / 35
Correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35
Correspondence autoencoder (cAE) Frame from other word in pair Combine top-down and bottom-up information Unsupervised feature extractor f a ( · ) Frame from one word 14 / 35
Correspondence autoencoder (cAE) Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus (3) Unsupervised feature extractor (2) Unsupervised term discovery Align word pair frames [Kamper et al., ICASSP’15] 15 / 35
Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE 16 / 35
Intrinsic evaluation: Isolated word query task 0 . 5 0 . 4 Average precision 0 . 3 0 . 2 0 . 1 0 . 0 Autoencoder UBM-GMM TopUBM cAE Extended: [Renshaw et al., IS’15] and [Yuan et al., IS’16] 16 / 35
Unsupervised segmentation and clustering: The Segmental Bayesian Model
Unsupervised segmentation and clustering: The Segmental Bayesian Model Aren Jansen Sharon Goldwater
Full-coverage segmentation and clustering 18 / 35
Full-coverage segmentation and clustering 18 / 35
Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : 19 / 35
Segmental modelling for full-coverage segmentation Previous models use explicit subword discovery directly on speech features, e.g. [Lee et al., 2015] : Our approach uses whole-word segmental representations, i.e. acoustic word embeddings [Kamper et al., IS’15; Kamper et al., TASLP’16] 19 / 35
Acoustic word embeddings 20 / 35
Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) 20 / 35
Acoustic word embeddings x i ∈ R d in d -dimensional space f e ( Y 1 ) Y 1 Y 2 f e ( Y 2 ) Dynamic programming alignment has quadratic complexity, while embedding comparison is linear time. Can use standard clustering. 20 / 35
Unsupervised segmental Bayesian model Speech waveform 21 / 35
Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35
Unsupervised segmental Bayesian model Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35
Unsupervised segmental Bayesian model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35
Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35
Unsupervised segmental Bayesian model p ( x i | h − ) Bayesian Gaussian mixture model Acoustic modelling Embeddings x i = f e ( y t 1 : t 2 ) f e ( · ) f e ( · ) f e ( · ) Acoustic frames y 1: M f a ( · ) f a ( · ) f a ( · ) Speech waveform 21 / 35
Recommend
More recommend