Robust model training and generalisation with Studentising flows - PowerPoint PPT Presentation

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, Stockholm, Sweden 2020-07-11 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 1 / 11

One-slide summary • We propose replacing Gaussian base distributions Z in normalising flows with multivariate Student’s t -distributions • Studentising flows • Our proposal is motivated through statistical robustness • Experiments show that the proposal stabilises training and leads to better generalisation Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 2 / 11

Outline • What is robustness? • Robustness sits in the tails • Tails of flow-based models • Experimental findings Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 3 / 11

Why do we need robustness? Generate some 1D standard normal data and fit a Gaussian: 0 . 3 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Why do we need robustness? Generate some 1D standard normal data and fit a Gaussian: 0 . 3 Gauss. 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Why do we need robustness? The fit changes if we add an outlying datapoint (red blob). 0 . 3 Gauss. 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Why do we need robustness? A fitted Student’s t -distribution (red plot) is more concentrated. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Why do we need robustness? In contrast, the Student’s t -distribution is statistically robust . 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

Robust statistics Robust ( resistant ) estimator: Adversarially corrupting a fraction η of the data ( η < 1 / 2 ) only has a bounded effect on the estimated model parameters � θ Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 5 / 11

Why is Student’s t robust? The probability density functions of Gaussians and Student’s t -distributions look similar. 0 . 5 Gauss. t ( ν = 4) 0 . 4 t ( ν = 15) 0 . 3 0 . 2 0 . 1 0 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

Why is Student’s t robust? The associated loss functions (the negative log-likelihood, or NLL) exhibit differences in the tails. 25 Gauss. t ( ν = 4) 20 t ( ν = 15) 15 10 5 0 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

Why is Student’s t robust? The influence function is the gradient of the NLL. It quantifies the effect of outliers. For the t -distribution the influence function is bounded. 5 Gauss. t ( ν = 4) t ( ν = 15) 0 − 5 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

Why is Student’s t robust? Gradient clipping can also limit the influence of outliers, but need not converge on the maximum-likelihood model. 5 Gauss. t ( ν = 4) t ( ν = 15) Clipped Gauss. 0 − 5 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

Related work Our findings complement those in concurrent work by Jaini et al. (2020) 1 • They show: • Lipschitz-continuous triangular flows f θ ( Z ) with Gaussian base distributions Z cannot represent fat-tailed data • For example: Glow with sigmoid-transformed scale factors • Using multivariate t ν -distributions allows modelling data with fat tails • We add to this: • The advantages of t ν -distributions can be understood through statistical robustness • Experimentally, these benefits extend to bounded data (no fat tails) 1 Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails of Lipschitz triangular flows. In Proc. ICML , 2020. Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 7 / 11

Stable training Training loss of Glow models of 64 × 64 CelebA data trained using Adam. The red configuration is unstable. 8 t ( = 50), lr=1e-3 Gauss. no grad-clip, lr=1e-4 7 Gauss. w. grad-clip, lr=1e-3 Gauss. no grad-clip, lr=5e-4 6 Training loss 5 4 3 2 0 200 400 600 800 1000 Steps Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

Stable training Reducing the learning rate (yellow), clipping gradients (green), or changing the base to a multivariate t ν -distribution (blue) stabilises training. 8 t ( = 50), lr=1e-3 Gauss. no grad-clip, lr=1e-4 7 Gauss. w. grad-clip, lr=1e-3 Gauss. no grad-clip, lr=5e-4 6 Training loss 5 4 3 2 0 200 400 600 800 1000 Steps Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

Better generalisation on image data Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = 20 50 1000 20 50 1000 ∞ ∞ Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ 0 − 0.03 − 0.03 0.01 0 − 0.36 − 0.37 − 0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22 outliers ∆ 0 − 0.04 − 0.03 0.01 0 − 0.03 − 0.02 0.01 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

Robust model training and generalisation with Studentising flows - PowerPoint PPT Presentation

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal

Core question Romance conjugations Romance conjugations Generalisation Generalisation Elicited

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

First attempts to Automatize Generalisation of Electronic Navigational Charts Weronika Socha

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018

On a Generalisation of Dillons APN Permutation Lo Perrin Anne Canteaut Sbastien Duval

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F.

Inductive Visual Localisation: Factorised Training for Superior Generalisation Ankush Gupta

More Regression Thomas J. Leeper Department of Political Science and Government Aarhus

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

Generalisation of Canonical Number Systems Paul Surer (joint work with K. Scheicher, J. M.

Last time on Types ... I Modified ML with polymorphic types anywhere Identity, Generalisation and

Coherence, Similarity, and Concept Generalisation Roberto Confalonieri 1 , Oliver Kutz 1 , Nicolas

Natural Language Processing 1 Lecture 6: Distributional semantics: generalisation and word

The Role of Geography in Automated Generalisation Mackaness, W.A. 1 , Gould, N.M. 2 1 School of

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Reactive programming @minebocek mine-cetinkaya-rundel Mine etinkaya-Rundel

Darrell Bethea May 13, 2011 1 3 Review of String methods Keyboard and Screen

Multiclass Neural Network Minimization via Tropical Newton Polytope Approximation Georgios

Feedback-controlled Random Test Generation Kohsuke Yatoh 1* , Kazunori Sakamoto 2 , Fuyuki

Robust model training and generalisation with Studentising flows - PowerPoint PPT Presentation

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal

Core question Romance conjugations Romance conjugations Generalisation Generalisation Elicited

Outlier Outlier Outlier- Outlier - -robust - robust robust robust identification

On the Eigenspectrum Eigenspectrum of the Gram of the Gram On the Matrix and the Generalisation

First attempts to Automatize Generalisation of Electronic Navigational Charts Weronika Socha

Generalisation Bounds for Neural Networks Pascale Gourdeau University of Oxford 15 November 2018

On a Generalisation of Dillons APN Permutation Lo Perrin Anne Canteaut Sbastien Duval

Short Course in Supervised Learning Robust Optimization and Machine Learning Robust Supervised

Generalisation error in learning with random features and the hidden manifold model B. Loureiro F.

Inductive Visual Localisation: Factorised Training for Superior Generalisation Ankush Gupta

More Regression Thomas J. Leeper Department of Political Science and Government Aarhus

Robust Location and Scatter Estimators Outline for Multivariate Data Analysis Background

Generalisation of Canonical Number Systems Paul Surer (joint work with K. Scheicher, J. M.

Last time on Types ... I Modified ML with polymorphic types anywhere Identity, Generalisation and

Coherence, Similarity, and Concept Generalisation Roberto Confalonieri 1 , Oliver Kutz 1 , Nicolas

Natural Language Processing 1 Lecture 6: Distributional semantics: generalisation and word

The Role of Geography in Automated Generalisation Mackaness, W.A. 1 , Gould, N.M. 2 1 School of

Model-theoretic and algebraic approach in machine learning (data mining, pattern recognition,

APA Commissioning Results Andrzej Szelc &amp; Serhan Tufanli Introduction What we measured

Chapter 9: Out utlie lier A r Ana naly lysis is Jilles Vreeken IRDM 15/16 8 Dec 2015

MDS Embedding MDS takes as input a distance matrix D , containing all N N pair of distances

Reactive programming @minebocek mine-cetinkaya-rundel Mine etinkaya-Rundel

Darrell Bethea May 13, 2011 1 3 Review of String methods Keyboard and Screen

Multiclass Neural Network Minimization via Tropical Newton Polytope Approximation Georgios

Feedback-controlled Random Test Generation Kohsuke Yatoh 1* , Kazunori Sakamoto 2 , Fuyuki

APA Commissioning Results Andrzej Szelc & Serhan Tufanli Introduction What we measured