learning using WaveNet autoencoders - PowerPoint PPT Presentation

Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocław 06.06.2019

Deep Model = Hierarchy of Concepts Cat Dog … Moon Banana M. Zieler, “Visualizing and Understanding Convolutional Networks”

Deep Learning history: 2006 2006: Stacked RBMs Hinton, Salakhutdinov , “Reducing the Dimensionality of Data with Neural Networks”

Deep Learning history: 2012 2012: Alexnet SOTA on Imagenet Fully supervised training

Deep Learning Recipe 1. Get a massive, labeled dataset 𝐸 = {(𝑦, 𝑧)} : – Comp. vision: Imagenet, 1M images – Machine translation: EuroParlamanet data, CommonCrawl, several million sent. pairs – Speech recognition: 1000h (LibriSpeech), 12000h (Google Voice Search) – Question answering: SQuAD, 150k questions with human answers – … 2. Train model to maximize log 𝑞(𝑧|𝑦)

Value of Labeled Data • Labeled data is crucial for deep learning • But labels carry little information: – Example: An ImageNet model has 30M weights, but ImageNet is about 1M images from 1000 classes Labels: 1M * 10bit = 10Mbits Raw data: (128 x 128 images): ca 500 Gbits!

Value of Un labeled Data “The brain has about 10 14 synapses and we only live for about 10 9 seconds. So we have a lot more parameters than data. This motivates the idea that we must do a lot of unsupervised learning since the perceptual input (including proprioception) is the only place we can get 10 5 dimensions of constraint per second .” Geoff Hinton https://www.reddit.com/r/MachineLearning/comments/2lmo0l/ama_geoffrey_hinton/

Unsupervised learning recipe 1. Get a massive labeled dataset 𝐸 = 𝑦 Easy, unlabeled data is nearly free 2. Train model t o…??? What is the task? What is the loss function?

Unsupervised learning by modeling data distribution Train the model to minimize − log 𝑞(𝑦) E.g. in 2D: • Let 𝐸 = {𝑦: 𝑦 ∈ ℝ 2 } • Each point is an 2 -dimensional vector • We can draw a point-cloud • And fit some known distribution, e.g. a Gaussian

Learning high dimensional distributions is hard • Assume we work with small (32x32) images • Each data point is a real vector of size 32 × 32 × 3 • Data occupies only a tiny fraction of ℝ 32×32×3 • Difficult to learn!

Autoregressive Models Decompose probability of data points in 𝑆 𝑜 into 𝑜 conditional univariate probabilities: 𝑞 𝑦 = 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑞 𝑦 1 𝑞 𝑦 2 𝑦 1 … 𝑞 𝑦 𝑜 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜−1 = 𝑞(𝑦 𝑗 |𝑦 <𝑗 ) 𝑗

Autoregressive Example: Language modeling Let 𝑦 be a sequence of word ids. 𝑞 𝑦 = 𝑞 𝑦 1 , 𝑦 2 , … , 𝑦 𝑜 = 𝑞(𝑦 𝑗 |𝑦 <𝑗 ) 𝑗 ≈ 𝑞 𝑦 𝑗 𝑦 𝑗−𝑙 , 𝑦 𝑗−𝑙+1 , … , 𝑦 𝑗−1 𝑗 p(It’s a nice day) = p(It) * p(‘ s|it) * p( a|’s)… • Classical n-gram models: cond. probs. estimated using counting • Neural models: cond. probs. estimated using neural nets

WaveNet: Autoregressive modeling of speech Treat speech as a sequence of samples! Predict each sample base on previous ones. https://arxiv.org/abs/1609.03499

PixelRNN: A “language model for images” Pixels generated left-to-right, top-to-bottom. Cond. probabilities estimated using recurrent or convolutional neural networks. van den Oord, A., et al. “Pixel Recurrent Neural Networks.” ICML (2016).

PixelCNN samples Salimans et al, “A PixelCNN Implementation with Discretized Logistic Mixture Likelihood and Other Modifications”

Autoregressive Models Summary The good: - Simple to define (pick an ordering). - Often yield SOTA log-likelihood. The bad: - Training and generation require 𝑃 𝑜 ops. - No compact intermediate data representation – not obvious how to use for downstream tasks.

Latent Variable Models Intuition: to generate something complicated, do: 1. Sample something simple 𝑨~𝒪(0,1) 2. Transform it 𝑦 = 𝑨 𝑨 10 + 𝑨

Variational autoencoder: A neural latent variable model Assume a 2 stage data generation process: 𝑨~𝒪 0,1 prior 𝑞(𝑨) assumed to be simple 𝑦~𝑞 𝑦 𝑨 complicated transformation implemented with a neural network How to train this model? log 𝑞(𝑦) = log 𝑞 𝑦 𝑨 𝑞(𝑨) 𝑨 This is often intractable!

ELBO: A lower bound on log 𝑞(𝑦) Let 𝑟(𝑨|𝑦) be any distribution. We can show that log 𝑞 𝑦 = log 𝑞 𝑨|𝑦 = 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝑞 𝑨 𝑦 + 𝔽 𝑨~𝑟 𝑨 𝑦 𝑟 𝑨 𝑦 𝑞 𝑦 log 𝑞 𝑨|𝑦 ≥ 𝔽 𝑨~𝑟 𝑨 𝑦 𝑟 𝑨 𝑦 𝑞 𝑦 = 𝔽 𝑨~𝑟 𝑨 𝑦 log 𝑞 𝑦 𝑨 − 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝑞 𝑨 The bound is tight for 𝑞 𝑨 𝑦 = 𝑟 𝑨 𝑦 .

ELBO interpretation ELBO, or evidence lower bound: log 𝑞 𝑦 ≥ 𝔽 𝑨~𝑟 𝑨 𝑦 log 𝑞 𝑦 𝑨 − 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝑞 𝑨 where: 𝔽 𝑨~𝑟 𝑨 𝑦 log 𝑞 𝑦 𝑨 reconstruction quality: how many nats we need to reconstruct 𝑦 , when someone gives us 𝑟 𝑨 𝑦 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝑞 𝑨 code transmission cost: how many nats we transmit about 𝑦 in 𝑟(𝑨|𝑦) rather than 𝑞 𝑨 Interpretation: do well at reconstructing 𝑦 , limiting the amount of information about 𝑦 encoded in 𝑨 .

The Variational Autoencoder 𝑞(𝑨) 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝑞 𝑨 𝑦 𝑞(𝑦|𝑨) q p 𝑟(𝑨|𝑦) 𝔽 𝑨~𝑟 𝑨 𝑦 log 𝑞 𝑦 𝑨 An input 𝑦 is put through the 𝑟 network to obtain a distribution over latent code 𝑨 , 𝑟(𝑨|𝑦) . Samples 𝑨 1 , … , 𝑨 𝑙 are drawn from 𝑟(𝑨|𝑦) . They 𝑙 reconstructions 𝑞(𝑦|𝑨 𝑙 ) are computed using the network 𝑞 .

VAE is an Information Bottleneck Each sample is represented as a Gaussian This discards information (latent representation has low precision)

VQVAE – deterministic quantization Limit precision of the encoding by quantizing (round each vector to a nearest prototype). Output can be treated: - As a sequence of discrete prototype ids (tokens) - As a distributed representation (the prototypes themselves) Train using the straight-through estimator, with auxiliary losses:

VAEs and sequential data To encode a long sequence, we apply the VAE to chunks: 𝑨 𝑨 𝑨 𝑨 𝑨 But neighboring chunks are similar! We are encoding the same information in many 𝑨 s! We are wasting capacity!

WaveNet + VAE A WaveNet reconstructs the waveform using the information from the past Latent representations are 𝑨 𝑨 𝑨 𝑨 𝑨 extracted at regular inervals. The WaveNet uses information from: 1. The past recording The latent vectors 𝑨 2. 3. Other conditioning, e.g. about speaker The encoder transmits in 𝑨 s only the information that is missing from the past recording . The whole system is a very low bitrate codec (roughly 0.7kbits/sec, the waveform is 16k Hz* 8bit=128kbit/sec) van den Oord et al. Neural discrete representation learning

VAE + autoregressive models: latent collapse danger • Purely Autoregressive models: SOTA log- likelihoods • Conditioning on latents: information passed through bottleneck lower reconstruction x-entropy • In standard VAE model actively tries to - reduce information in the latents - maxmally use autoregressive information => Collapse: latents are not used! • Solution: stop optimizing KL term (free bits), make it a hyperparam (VQVAE)

Model description WaveNet decoder conditioned on: - latents extracted at 24Hz-50Hz - speaker spkr spkr spkr spkr spkr 3 bottleneck evaluated: 𝑨 𝑨 𝑨 𝑨 𝑨 - Dimensionality reduction, max 32 bits/dim - VAE, 𝐿𝑀 𝑟 𝑨 𝑦 ∥ 𝒪 0,1 nats (bits) - VQVAE with 𝐿 protos: log 2 𝐿 bits Or Input: Waveforms, Mel Filterbanks, MFCCs Hope: speaker separated form content. Proof: https://arxiv.org/abs/1805.09458

Representation probing points We have inserted probing classifiers at 4 points in the network: 𝑞 𝑑𝑝𝑜𝑒 : several 𝑨 codes mixed together using a convolution. The wavenet uses it for conditioning 𝑞 𝑐𝑜 : the latent codes 𝑨 𝑨 𝑨 𝑨 𝑨 𝑞 𝑞𝑠𝑝𝑘 : low dimensional representation input to the bottleneck layer 𝑞 𝑓𝑜𝑑 : high dimensional representation coming out of the encoder

Experimental Questions • What information is captured in the latent codes/probing points? • What is the role of the bottleneck layer? • Can we regularize the latent representation? • How to promote a segmentation? • How good is the representation on downstream tasks? • What design choices affect it? Chorowski et al. Unsupervised speech representation learning using WaveNet autoencoders

VQVAE Latent representation

What information is captured in the latent codes? For each probing point, we have trained predictors for: - Framewise phoneme prediction - Speaker prediction - Gender predicion - Mel Filterbank reconstruction

Results

Phonemes vs Gender tradeoff

learning using WaveNet autoencoders - PowerPoint PPT Presentation

Unsupervised speech representation learning using WaveNet autoencoders https://arxiv.org/abs/1901.08810 Jan Chorowski University of Wrocaw 06.06.2019 Deep Model = Hierarchy of Concepts Cat Dog Moon Banana M. Zieler, Visualizing

Accurate river level predictions using a Wavenet-like model Shannon Doyle and Anastasia Borovykh

Deep Autoregressive Models mainly PixelCNN and Wavenet 1 Another Way to Generate

L ET S RECAPITULATE [2/2] W HAT WE HERE NOWADAYS Objectively Wavenet was a game changer

Speech Synthesis, Reinforcement Learning Milan Straka May 13, 2019 Charles University in Prague

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Using Learning Goals to Inform Instruction Erin Meikle Overview What we mean by learning goals

Using Single Photons Using Single Photons Using Single Photons Using Single Photons for WIMP

Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based Activity Using Rule-Based

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

Reinforcement Learning Timothy Chou Charlie Tong Vincent Zhuang April 19, 2016 Reinforcement

Learning From Data Lecture 2 The Perceptron The Learning Setup A Simple Learning Algorithm: PLA

Recursive State Estimation 2 Lecture 8 Recap Today Kalman Filter Extended Kalman Filter

Foveal Maintenance systems Steady Fixation Pursuits conjugate (version) disjunctive (vergence)

Persistent Reference on the Web Jonathan Rees Creative Commons W3C TAG F2F 20 October 2010

Rationality Percepts: Video, sonar, speedometer, odometer, engine An ideal rational agent ,

Spatial navigation in humans (mostly) Sensing the world with whiskers in rats and robots Homing

Class Crustacea: Senses, Development and more Taxonomy A big day in 310 Crustacean Senses

CSC 2524, Fall 2017 AR/VR Interaction Interface Karan Singh Adapted from and with thanks to Mark

Progressive Nets for Simulation to Robot Transfer Raia Hadsell Skepticism Lets acknowledge a