robust model training and generalisation with
play

Robust model training and generalisation with Studentising flows - PowerPoint PPT Presentation

Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal


  1. Robust model training and generalisation with Studentising flows Simon Alexanderson Gustav Eje Henter {simonal,ghe}@kth.se Division of Speech, Music and Hearing (TMH), School of Electrical Engineering and Computer Science (EECS), KTH Royal Institute of Technology, Stockholm, Sweden 2020-07-11 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 1 / 11

  2. One-slide summary • We propose replacing Gaussian base distributions Z in normalising flows with multivariate Student’s t -distributions • Studentising flows • Our proposal is motivated through statistical robustness • Experiments show that the proposal stabilises training and leads to better generalisation Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 2 / 11

  3. Outline • What is robustness? • Robustness sits in the tails • Tails of flow-based models • Experimental findings Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 3 / 11

  4. Why do we need robustness? Generate some 1D standard normal data and fit a Gaussian: 0 . 3 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  5. Why do we need robustness? Generate some 1D standard normal data and fit a Gaussian: 0 . 3 Gauss. 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  6. Why do we need robustness? The fit changes if we add an outlying datapoint (red blob). 0 . 3 Gauss. 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  7. Why do we need robustness? The fit changes if we add an outlying datapoint (red blob). 0 . 3 Gauss. 0 . 25 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  8. Why do we need robustness? A fitted Student’s t -distribution (red plot) is more concentrated. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  9. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  10. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  11. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  12. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  13. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  14. Why do we need robustness? As the outlier is moved away, the Gaussian fit changes a lot. 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  15. Why do we need robustness? In contrast, the Student’s t -distribution is statistically robust . 0 . 3 Gauss. 0 . 25 t ( ν = 1 . 5) 0 . 2 p ( x ) 0 . 15 0 . 1 0 . 05 0 -2 0 2 4 6 8 10 12 x Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 4 / 11

  16. Robust statistics Robust ( resistant ) estimator: Adversarially corrupting a fraction η of the data ( η < 1 / 2 ) only has a bounded effect on the estimated model parameters � θ Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 5 / 11

  17. Why is Student’s t robust? The probability density functions of Gaussians and Student’s t -distributions look similar. 0 . 5 Gauss. t ( ν = 4) 0 . 4 t ( ν = 15) 0 . 3 0 . 2 0 . 1 0 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

  18. Why is Student’s t robust? The associated loss functions (the negative log-likelihood, or NLL) exhibit differences in the tails. 25 Gauss. t ( ν = 4) 20 t ( ν = 15) 15 10 5 0 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

  19. Why is Student’s t robust? The influence function is the gradient of the NLL. It quantifies the effect of outliers. For the t -distribution the influence function is bounded. 5 Gauss. t ( ν = 4) t ( ν = 15) 0 − 5 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

  20. Why is Student’s t robust? Gradient clipping can also limit the influence of outliers, but need not converge on the maximum-likelihood model. 5 Gauss. t ( ν = 4) t ( ν = 15) Clipped Gauss. 0 − 5 -10 -5 0 5 10 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 6 / 11

  21. Related work Our findings complement those in concurrent work by Jaini et al. (2020) 1 • They show: • Lipschitz-continuous triangular flows f θ ( Z ) with Gaussian base distributions Z cannot represent fat-tailed data • For example: Glow with sigmoid-transformed scale factors • Using multivariate t ν -distributions allows modelling data with fat tails • We add to this: • The advantages of t ν -distributions can be understood through statistical robustness • Experimentally, these benefits extend to bounded data (no fat tails) 1 Jaini, P., Kobyzev, I., Yu, Y., and Brubaker, M. Tails of Lipschitz triangular flows. In Proc. ICML , 2020. Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 7 / 11

  22. Stable training Training loss of Glow models of 64 × 64 CelebA data trained using Adam. The red configuration is unstable. 8 t ( = 50), lr=1e-3 Gauss. no grad-clip, lr=1e-4 7 Gauss. w. grad-clip, lr=1e-3 Gauss. no grad-clip, lr=5e-4 6 Training loss 5 4 3 2 0 200 400 600 800 1000 Steps Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

  23. Stable training Reducing the learning rate (yellow), clipping gradients (green), or changing the base to a multivariate t ν -distribution (blue) stabilises training. 8 t ( = 50), lr=1e-3 Gauss. no grad-clip, lr=1e-4 7 Gauss. w. grad-clip, lr=1e-3 Gauss. no grad-clip, lr=5e-4 6 Training loss 5 4 3 2 0 200 400 600 800 1000 Steps Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 8 / 11

  24. Better generalisation on image data Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = 20 50 1000 20 50 1000 ∞ ∞ Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ 0 − 0.03 − 0.03 0.01 0 − 0.36 − 0.37 − 0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22 outliers ∆ 0 − 0.04 − 0.03 0.01 0 − 0.03 − 0.02 0.01 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

  25. Better generalisation on image data Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = 20 50 1000 20 50 1000 ∞ ∞ Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ 0 − 0.03 − 0.03 0.01 0 − 0.36 − 0.37 − 0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22 outliers ∆ 0 − 0.04 − 0.03 0.01 0 − 0.03 − 0.02 0.01 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

  26. Better generalisation on image data Test set negative log-likelihood on MNIST with and without outliers from greyscale CIFAR-10. ν = ∞ is the Gaussian baseline. Test Clean 1% outliers Train ν = 20 50 1000 20 50 1000 ∞ ∞ Clean NLL 1.16 1.13 1.13 1.17 1.63 1.27 1.26 1.31 ∆ 0 − 0.03 − 0.03 0.01 0 − 0.36 − 0.37 − 0.32 1% NLL 1.17 1.13 1.14 1.18 1.21 1.18 1.19 1.22 outliers ∆ 0 − 0.04 − 0.03 0.01 0 − 0.03 − 0.02 0.01 Alexanderson & Henter (KTH) Robustness through Studentising flows 2020-07-11 9 / 11

Recommend


More recommend