Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces Yu-An Chung Wei-Hung Weng Schrasing Tong James Glass Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Cambridge, Massachusetts, USA NeurIPS Montréal, Québec, Canada December 2018
Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text
Machine Translation (MT) Training data French English ( , ) “the cat is black” MT system “le chat est noir” translation text Automatic Speech Recognition (ASR) English English ( , ) “dogs are cute” ASR system transcription audio Text-to-Speech Synthesis (TTS) English English ( , ) TTS system “cats are adorable” audio text Parallel corpora for training à Expensive to collect!
Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular …
Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec [Mikolov et al., 2013] y $ ( y % ∈ ℝ +×' Y = ⋮ y '
Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( y % ∈ ℝ +×' Y = ⋮ y '
Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x $ ( x % ∈ ℝ +×' X = ⋮ x ' y $ ( MUSE y % [Lample et al., 2018] ∈ ℝ +×' Y = ⋮ y ' Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX
Do not need to be parallel! Framework Language 1 Language 2 Wikipedia is a multilingual, web-based, free encyclopedia based on a model of openly editable and viewable content, a wiki. It is the largest and most popular … Word2vec Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic
Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] x % ) x & ∈ ℝ ,×( X = ⋮ x ( y % ) MUSE y & [Lample et al., 2018] ∈ ℝ ,×( Y = ⋮ y ( Learn a linear mapping W such that W ∗ = argmin WX − Y 9 W ∈ℝ 7×7 WX Components p Word2vec • Learns distributed representations of words from a text corpus that model word semantics in an unsupervised manner p Speech2vec • A speech version of word2vec that learns semantic word representations from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two embedding spaces are approximately isomorphic
Do not need to be parallel! Framework Advantages Language 1 Language 2 p Rely only on monolingual corpora of speech and text that: • Do not need to be parallel • Wikipedia is a multilingual, web-based, free Can be collected independently, greatly reducing human encyclopedia based on a model of openly labeling efforts editable and viewable content, a wiki. It is p The framework is unsupervised: the largest and most popular … • Each component uses unsupervised learning • Applicable to low-resource language pairs that lack bilingual Word2vec resources Speech2vec [Mikolov et al., 2013] [Chung & Glass, 2018] Usage of the learned W x % ) x & ∈ ℝ ,×( X = #1 : Unsupervised spoken word recognition: when language 1 = language 2 ⋮ (e.g., both English) x ( Input spoken y % ) “ dog ” word “dog” MUSE y & “dogs” Nearest neighbor [Lample et al., 2018] ∈ ℝ ,×( Y = + ⋮ search “puppy” y ( “pet” ⋮ Foundation for Unsupervised Automatic Speech Recognition Learn a linear mapping W such that An interesting property of our approach: synonym retrieval W ∗ = argmin à The list of nearest neighbors actually contain both synonyms and WX − Y 9 W ∈ℝ 7×7 WX different lexical forms of the input spoken word. Components #2 : Unsupervised spoken word translation: when language 1 ≠ language 2 p Word2vec (e.g., English to French) • Input spoken Learns distributed representations of words from a text corpus that model word “dog” word semantics in an unsupervised manner “ chien ” Nearest neighbor p Speech2vec + “chiot” search • A speech version of word2vec that learns semantic word representations ⋮ from a speech corpus; also unsupervised p MUSE • An unsupervised way to learn W with the assumption that the two Foundation for Unsupervised Speech-to-Text Translation embedding spaces are approximately isomorphic
Unsupervised Cross-Modal Alignment of Speech and Text Embedding Spaces 10:45 AM – 12:45 PM Room 210 & 230 AB #156
Recommend
More recommend