(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning - PowerPoint PPT Presentation

(Outrageously ∗ ) Low-Resource Speech Processing NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/

(Outrageously ∗ ) Low-Resource Speech Processing NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ ∗ Title plagiarised from Jade Abbott’s DLI talk

Supervised speech recognition i had to think of some example speech since speech recognition is really cool 2 / 25

Unsupervised (“zero-resource”) speech processing My problem: What can we learn if we do not have any labels? 3 / 25

Example: Query-by-example speech search [Jansen and Van Durme, Interspeech’12] 4 / 25

Example: Query-by-example speech search Spoken query: [Jansen and Van Durme, Interspeech’12] 4 / 25

Example: Query-by-example speech search Spoken query: Useful speech system, not requiring any transcribed speech [Jansen and Van Durme, Interspeech’12] 4 / 25

Outrageously low-resource = unsupervised speech processing (outline) 5 / 25

Outrageously low-resource = unsupervised speech processing (outline) • Why is this problem so important? Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications 5 / 25

Outrageously low-resource = unsupervised speech processing (outline) • Why is this problem so important? Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications • What are the key ideas needed to tackle this problem? Hopefully you will get some useful tools 5 / 25

Outrageously low-resource = unsupervised speech processing (outline) • Why is this problem so important? Will try to convince you that this is (one of) the most fundamental machine learning problems, with real impactful applications • What are the key ideas needed to tackle this problem? Hopefully you will get some useful tools • What is still missing? What are the open problems and research questions which still need to be solved (according to me) 5 / 25

Why is this problem so important?

1. A fundamental machine learning problem Problems in unsupervised speech processing: 7 / 25

1. A fundamental machine learning problem Problems in unsupervised speech processing: • Learning useful representations from unlabelled speech • Segmenting, clustering and discovering longer-spanning (word- or phrase-like) patterns 7 / 25

1. A fundamental machine learning problem Problems in unsupervised speech processing: • Learning useful representations from unlabelled speech • Segmenting, clustering and discovering longer-spanning (word- or phrase-like) patterns • Combined problem of perception, structure, continuous and discrete variables 7 / 25

1. A fundamental machine learning problem Problems in unsupervised speech processing: • Learning useful representations from unlabelled speech • Segmenting, clustering and discovering longer-spanning (word- or phrase-like) patterns • Combined problem of perception, structure, continuous and discrete variables “The goal of machine learning is to develop methods that can automatically detect patterns in data . . . ” — Murphy “Extract important patterns and trends, and understand ‘what the data says’ . . . ” — Hastie, Tibshirani, Friedman “The problem of searching for patterns in data is . . . fundamental . . . ” — Bishop 7 / 25

2. Universal speech technology “Imagine a world in which every single human being can freely share in the sum of all knowledge.” 8 / 25

2. Universal speech technology “Imagine a world in which every single human being can freely share in the sum of all knowledge.” — Mission statement stolen from Laura Martinus 8 / 25

2. Universal speech technology “Imagine a world in which every single human being can freely share in the sum of all knowledge.” — Mission statement stolen from Laura Martinus — Who stole it from the Wikimedia Foundation https://15.wikipedia.org/endowment.html 8 / 25

2. Universal speech technology UN Pulse Lab, Kampala https://www.kpvu.org/post/turn-tune-transcribe-un-develops-radio-listening-tool 9 / 25

2. Universal speech technology Proposed System KEYWORD SPOTTER CNN-DTW Live radio stream PREPROCESS Speech Existing System KEYWORD ASR Lattices SEARCH HUMAN DATABASE ANALYSTS Keywords, timing, probs [Saeb et al., 2017; Menon et al., 2018] 10 / 25

2. Universal speech technology [Renkens, PhD’18] 11 / 25

2. Universal speech technology Linguistic and cultural documentation and preservation: http://www.stevenbird.net/ 12 / 25

2. Universal speech technology http://www.stevenbird.net/ 13 / 25

3. Understanding human language acquisition • Cognitive modelling: Try to uncover learning mechanisms in humans • A model of human language acquisition: Can probe easily • Example applications: — Identify hearing loss early — Predict learning difficulties — How much do we need to talk to infants? https://bergelsonlab.com/seedlings/ 14 / 25

Three ideas to tackle these problems

1. Build in the (domain) knowledge we have 16 / 25

1. Build in the (domain) knowledge we have • Pushing the model in a direction: inductive bias, Bayesian priors, regularisation, data augmentation 16 / 25

1. Build in the (domain) knowledge we have • Pushing the model in a direction: inductive bias, Bayesian priors, regularisation, data augmentation • In unsupervised learning this is all we have 16 / 25

1. Build in the (domain) knowledge we have • Pushing the model in a direction: inductive bias, Bayesian priors, regularisation, data augmentation • In unsupervised learning this is all we have • We know a lot about languages in general 16 / 25

1. Build in the (domain) knowledge we have • Pushing the model in a direction: inductive bias, Bayesian priors, regularisation, data augmentation • In unsupervised learning this is all we have • We know a lot about languages in general • Example: Although speech sounds are produced differently in different languages, there are aspects which are shared 16 / 25

1. Build in the (domain) knowledge we have Share representations across languages: German Korean French BNF layer Hidden layers MFCC + i-vector [Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791 ] 17 / 25

1. Build in the (domain) knowledge we have 75 ES HA HR Average precision (%) 65 55 45 BNF 1 BNF 2 EN cAE UTD cAE gold 35 75 SV TR ZH Average precision (%) 65 55 45 35 0 40 80 120 160 200 0 40 80 120 160 200 0 40 80 120 160 200 Hours of data Hours of data Hours of data [Hermann and Goldwater, 2018; Hermann et al., 2018; https://arxiv.org/abs/1811.04791 ] 18 / 25

2. Compression 19 / 25

2. Compression Autoencoder: x x ˆ h x || 2 Loss for single training example: J = || x − ˆ 19 / 25

2. Compression Vector-quantised variational autoencoder (VQ-VAE): e 1 e K · · · Codebook x x ˆ h z select closest z = e k where k = argmin K j =1 || h − e j || 2 x || 2 + || sg( h ) − e k || 2 + β || h − sg( e k ) || 2 J = α || x − ˆ 20 / 25

2. Compression: An example from our group Waveform Symbol-to-speech module FFTNet Vocoder y 1: T ˆ Filterbanks Decoder Benjamin Andr´ e van Niekerk Nortje Compression model z 1: N Embed Speaker ID Discretise Language Input Synthesised output h 1: N English Play Play Encoder Indonesian Play Play x 1: T MFCCs https://arxiv.org/abs/1904.07556 21 / 25

3. Learning from multiple modalities 22 / 25

3. Learning from multiple modalities d ( y vis , y spch ) distance y vis y spch max feedfwd conv VGG max X [Harwath et al., NeurIPS’16] 22 / 25

3. Learning from multiple modalities One-shot multimodal learning and matching: Ryan Eloff Leanne Nortje 23 / 25

3. Learning from multiple modalities One-shot multimodal learning and matching: Ryan Query: Eloff Support set ( two ) Leanne Nortje Multimodal one-shot learning Multimodal one-shot matching [Eloff et al., ICASSP’19; https://arxiv.org/abs/1811.03875 ] 23 / 25

(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning - PowerPoint PPT Presentation

(Outrageously ) Low-Resource Speech Processing NLP @ Deep Learning Indaba, Kenya, 2019 Herman Kamper E&E Engineering, Stellenbosch University, South Africa http://www.kamperh.com/ (Outrageously ) Low-Resource Speech Processing NLP

Speech Processing Speech Processing Using Speech with Computers Overview Overview Speech vs

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Synthesis Evaluation

Speech Processing 15-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Speech Processing 15- -492/18 492/18- -492 492 Speech Processing 15 Speech Synthesis Prosody

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Acoustic

6-Text To Speech (TTS) Speech Synthesis Speech Synthesis Concept Speech Naturalness Phone

Speech Processing for Speech Processing for Unwritten Languages Unwritten Languages Alan W

Speech Processing 15-492/18-492 Speech Recognition Signal Processing Analog to Digital Speech

Speech Processing 11-492/18-492 Speech Synthesis Overview Text processing Speech Synthesis

Chapter 1 Introduction to Speech Signal Processing 1 Outline The

Speech Processing 15-492/18-492 Speech Processing Current Topics and Future challenges

Speech Processing 11-492/18-492 Speech Processing 11-492/18-492 Speech Recognition Grammars

Speech Processing 11-492/18-495 Speech Processing Current Topics and Future challenges

Speech Processing 15-492/18-492 Speech Synthesis Pronunciation Letter to Sound rules Speech

Speech Processing 15-492/18-492 Computer Speech Analog to Digital Speech (sound) is analog

Part-of-Speech Tagging Part-of-Speech Tagging Berlin Chen 2003 References: 1. Speech and

GDC: The GNU D Compiler Iain Bucaw @ibuclaw GHM 2013 Iain Bucaw (@ibuclaw) (slide 1) GHM

Computer Graphics (CS 543) L Lecture 6 (Part 1): Implementing 6 (P 1) I l i Transformations

An Optimal Distributed Discrete Log Protocol with Applications to Homomorphic Secret Sharing Itai

On ratchet-cap pricing problems under the Levy LIBOR model Hsuan Ku Liu Department of

presentations some hints some hints How to give good seminar Friedemann Mattern , ETH Zurich

Local tableaux for reasoning in distributed description logics a Luciano Serafini 1 and Andrei

T Towards Integrated Policy d I t t d P li Managem ent for Privacy Managem ent for Privacy

Financial Liquidity and Savings: Evidence from 401K Loans John Beshears John Beshears David