Deep convolutional acoustic word embeddings using word-pair side information Herman Kamper 1 , Weiran Wang 2 , Karen Livescu 2 1 CSTR and ILCC, School of Informatics, University of Edinburgh, UK 2 Toyota Technological Institute at Chicago, USA ICASSP 2016
Introduction ◮ Most speech processing systems rely on deep architectures to classify speech frames into subword units (HMM triphone states). ◮ Requires pronunciation dictionary for breaking words into subwords; in many cases still makes frame-level independence assumptions. ◮ Some studies have started to reconsider whole words as basic modelling unit [Heigold et al. , 2012; Chen et al. , 2015]. 2 / 17
Segmental automatic speech recognition Segmental conditional random field Whole-word lattice rescoring [Bengio ASR [Maas et al. , 2012]: and Heigold, 2014]: ran, f 1 =1 Andrew, f 1 =0 3 / 17
Segmental query-by-example search From [Levin et al. , 2015]: LapEig Search audio segments Segment embeddings LapEig NN Index Query audio Query embedding Query result(s) Fig. 1 . Diagram of the S-RAILS audio search system. [Chen et al. , 2015]: Similar scheme for “Okay Google” using LSTMs. 4 / 17
Segmental query-by-example search From [Levin et al. , 2015]: LapEig Search audio segments Segment embeddings LapEig NN Index Query audio Query embedding Query result(s) Fig. 1 . Diagram of the S-RAILS audio search system. [Chen et al. , 2015]: Similar scheme for “Okay Google” using LSTMs. In this work, we also use a query-related task for evaluation. 4 / 17
Acoustic word embedding problem x i ∈ R d in d -dimensional space f ( Y 1 ) Y 1 Y 2 f ( Y 2 ) 5 / 17
Reference vector method [Levin et al. , 2013] 6 / 17
Reference vector method [Levin et al. , 2013] Segment we want to embed: y t 1 : t 2 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Segment we want to embed: y t 1 : t 2 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we want to embed: y t 1 : t 2 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: y t 1 : t 2 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: Dist 3 y t 1 : t 2 Dist 4 Dist m 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want to embed: Dist 3 y t 1 : t 2 Dist 4 Dist m ∈ R m 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Segment we Dist 2 want Dimensionality to embed: reduction: Dist 3 × P P ∈ R m × d y t 1 : t 2 Dist 4 Dist m ∈ R m 6 / 17
Reference vector method [Levin et al. , 2013] Reference set Y ref : Dist 1 Embedding: Segment we Dist 2 x i = f ( y t 1 : t 2 ) want Dimensionality to embed: reduction: Dist 3 × P P ∈ R m × d y t 1 : t 2 Dist 4 ∈ R d in fixed Dist m d -dimensional space ∈ R m 6 / 17
Word classification CNN [Bengio and Heigold, 2014] 7 / 17
Word classification CNN [Bengio and Heigold, 2014] w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 max × n conv convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected fully max × n conv convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i 0 0 0 · · · 1 · · · 0 0 connected × n full fully max × n conv convolution Y i 7 / 17
Word classification CNN [Bengio and Heigold, 2014] softmax w i x i = f ( Y i ) 0 0 0 · · · 1 · · · 0 0 connected × n full fully max × n conv convolution Y i 7 / 17
Supervision and side information ◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli` ere et al. , 2015]) are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } ◮ Also aligns with query / word discrimination task: does two speech segments contain instances of the same word? (Don’t care about word identity.) 8 / 17
Supervision and side information ◮ The word classifier CNN assumes a corpus of labelled word segments. ◮ In some cases these might not be available. ◮ Weaker form of supervision we sometimes have (e.g. [Thiolli` ere et al. , 2015]) are known word pairs: S train = { ( m , n ) : ( Y m , Y n ) are of the same type } ◮ Also aligns with query / word discrimination task: does two speech segments contain instances of the same word? (Don’t care about word identity.) Can we use this weak supervision (sometimes called side information) to train an acoustic word embedding function f ? 8 / 17
Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. 9 / 17
Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. x 1 = f ( Y 1 ) x 2 = f ( Y 2 ) Y 1 Y 2 9 / 17
Word similarity Siamese CNN Use idea of Siamese networks [Bromley et al. , 1993]. distance l ( x 1 , x 2 ) x 1 = f ( Y 1 ) x 2 = f ( Y 2 ) Y 1 Y 2 9 / 17
Loss functions 10 / 17
Loss functions The coscos 2 loss [Synnaeve et al. , 2014]: � 1 − cos( x 1 , x 2 ) if same 2 l cos cos 2 ( x 1 , x 2 ) = cos 2 ( x 1 , x 2 ) if different same different 10 / 17
Loss functions The coscos 2 loss [Synnaeve et al. , 2014]: � 1 − cos( x 1 , x 2 ) if same 2 l cos cos 2 ( x 1 , x 2 ) = cos 2 ( x 1 , x 2 ) if different same different Margin-based hinge loss [Mikolov, 2013]: l cos hinge = max { 0 , m + d cos ( x 1 , x 2 ) − d cos ( x 1 , x 3 ) } where d cos ( x 1 , x 2 ) = 1 − cos( x 1 , x 2 ) is the cosine distance between x 1 and x 2 , and 2 m is a margin parameter. Pair ( x 1 , x 2 ) are same, ( x 1 , x 3 ) are different. 10 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” Treat as query “apple” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” Treat as terms to search Treat as query “pie” “grape” “apple” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” “pie” “grape” “apple” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? Cosine distance: predict: “pie” d 1 “grape” “apple” “apple” “apple” “like” 11 / 17
Embedding evaluation: the same-different task Proposed in [Carlin et al. , 2011] and also used in [Levin et al. , 2013]. “apple” “pie” “grape” “apple” “apple” “like” d i < threshold? Cosine distance: predict: “pie” d 1 different “grape” “apple” “apple” “apple” “like” 11 / 17
Recommend
More recommend