Unsupervised neural network based feature extraction using weak top-down constraints Herman Kamper 1 , 2 , Micha Elsner 3 , Aren Jansen 4 , Sharon Goldwater 2 1 CSTR and 2 ILCC, School of Informatics, University of Edinburgh, UK 3 Department of Linguistics, The Ohio State University, USA 4 HLTCOE and CLSP, Johns Hopkins University, USA ICASSP 2015
Introduction ◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten), data is being collected. ◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data. 2 / 16
Introduction ◮ Huge amounts of speech audio data are becoming available online. ◮ Even for severely under-resourced and endangered languages (e.g. unwritten), data is being collected. ◮ Generally this data is unlabelled. ◮ We want to build speech technology on available unlabelled data. ◮ Need unsupervised speech processing techniques. 2 / 16
Example application: query-by-example search 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: 3 / 16
Example application: query-by-example search Spoken query: What features should we use to represent the speech for such unsupervised tasks? 3 / 16
Supervised neural network feature extraction 4 / 16
Supervised neural network feature extraction Output: predict phone states ay ey k v Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16
Supervised neural network feature extraction Output: predict phone states ay ey k v Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16
Supervised neural network feature extraction Output: predict phone states ay ey k v Phone classifier (learned jointly) Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks 4 / 16
Supervised neural network feature extraction Output: predict phone states ay ey k v Phone classifier (learned jointly) Feature extractor (learned from data) Input: speech frame(s) e.g. MFCCs, filterbanks But what if we do not have phone class targets to train our network? 4 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery 5 / 16
Weak supervision: unsupervised term discovery Can we use these discovered word pairs to provide us with weak supervision? 5 / 16
Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013] 6 / 16
Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16
Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16
Weak supervision: align the discovered word pairs Use correspondence idea from [Jansen et al., 2013]: 6 / 16
Autoencoder (AE) neural network 7 / 16
Autoencoder (AE) neural network Output is same as input Input speech frame A normal autoencoder neural network is trained to reconstruct its input. 7 / 16
Autoencoder (AE) neural network Output is same as input Input speech frame This reconstruction criterion can be used to pretrain a deep neural network. 7 / 16
The correspondence autoencoder (cAE) Frame from other word in pair Frame from one word The correspondence autoencoder (cAE) takes a frame from one word, and tries to reconstruct the corresponding frame from the other word in the pair. 8 / 16
The correspondence autoencoder (cAE) Frame from other word in pair Unsupervised feature extractor Frame from one word In this way we learn an unsupervised feature extractor using the weak word-pair supervision. 8 / 16
Complete unsupervised cAE training algorithm Train correspondence (1) (4) Train stacked autoencoder autoencoder (pretraining) Initialize weights Speech corpus Unsupervised (3) feature extractor (2) Unsupervised term discovery Align word pair frames 9 / 16
Evaluation of features: the same-different task 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” DTW distance: “pie” d 1 “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” d 3 “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” d 3 same “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” � d 3 same “apple” “like” 10 / 16
Evaluation of features: the same-different task “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? DTW distance: predict: “pie” � different d 1 “grape” d 2 same × “apple” “apple” � d 3 same “apple” d 4 different × “like” � d N different 10 / 16
Recommend
More recommend