tighter risk certificates for probabilistic neural
play

Tighter risk certificates for (probabilistic) neural networks Omar - PowerPoint PPT Presentation

Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40 The crew Mar a P erez-Ortiz (UCL) Yours truly (UCL / DeepMind) Csaba


  1. Tighter risk certificates for (probabilistic) neural networks Omar Rivasplata o.rivasplata@cs.ucl.ac.uk 01 July 2020 UCL Centre for AI Slide 1 / 40

  2. The crew • Mar´ ıa P´ erez-Ortiz (UCL) • Yours truly (UCL / DeepMind) • Csaba Szepesv´ ari (DeepMind) • John Shawe-Taylor (UCL) O. Rivasplata Slide 2 / 40

  3. Overview of this talk ⊲ Motivation ⊲ Classic NNs: weights ⊲ Probabilistic NNs: random weights ⊲ Highlights of experiments ⊲ Conclusions O. Rivasplata Slide 3 / 40

  4. What motivated this project O. Rivasplata Slide 4 / 40

  5. Blundell et al. (2015) • Variational Bayes : min θ KL ( q θ ( w ) � p ( w | D )) • Objective : f ( θ ) = E q θ ( w ) [log(1 / p ( D | w ))] + KL ( q θ ( w ) � p ( w )) (ELBO) • Algorithm : ‘Bayes by Backprop’ O. Rivasplata Slide 5 / 40

  6. Thiemann et al. (2017) • PAC-Bayes-lambda : E q θ ( w ) [ L ( w )] ≤ E q θ ( w ) [ ˆ L n ( w , D )] + KL ( q θ ( w ) � p ( w )) + C n λ ∈ (0 , 2) 1 − λ/ 2 n λ (1 − λ/ 2) • Algorithm : f ( θ, λ ) = RHS, alternated optimization over θ and λ O. Rivasplata Slide 6 / 40

  7. Dziugaite & Roy (2017) Optimized a classic PAC-Bayes bound • • Experiments on ‘binary MNIST’ ([0-4] vs. [5-9]) • Demonstrated non-vacuous risk bound values O. Rivasplata Slide 7 / 40

  8. Classic Neural Nets O. Rivasplata Slide 8 / 40

  9. What to achieve from data? Motivation Use the available data to: Classic weights Random weights (1) learn a weight vector ˆ w Experiments (2) certify ˆ w ’s performance Conclusions • split the data, part for (1) and part for (2)? • the whole of the data for (1) and (2) simultaneously? ⊲ self-certified learning! O. Rivasplata Slide 9 / 40

  10. Learning framework ✞ ☎ ✞ ☎ Motivation ALG : Z n → W ✝ ✆ ✝ ✆ W → H Classic weights Random weights Experiments • W ⊂ R p • Z = X × Y • H function class Conclusions X = set of inputs weight space predictors Y = set of labels w = ALG (data) ˆ w : X → Y h ˆ data set: D = ( Z 1 , . . . , Z n ) ∈ Z n (e.g. training set) a finite sequence of input-label examples Z i = ( X i , Y i ). O. Rivasplata Slide 10 / 40

  11. A measure of performance Motivation n L n ( w , D ) = 1 Classic weights � L n ( w ) = ˆ ˆ Empirical risk: ℓ ( w , Z i ) Random weights n i = 1 (in-sample error) Experiments Conclusions Tied to the choice of a loss function ℓ ( w , z ) • the square loss (regression) the 0-1 loss (classification) • the cross-entropy loss (NN classification) • ⊲ surrogate loss, nice properties O. Rivasplata Slide 11 / 40

  12. Empirical Risk Minimization Motivation Classic weights � ˆ 1 Training set error: L trn ( w ) = ℓ ( w , Z i ) Random weights n trn Experiments Z i ∈ D trn Conclusions ˆ ERM: w ∈ arg min ˆ L trn ( w ) w ˆ Penalized ERM: w ∈ arg min ˆ L trn ( w ) + Reg( w ) w O. Rivasplata Slide 12 / 40

  13. Generalization Motivation Classic weights Random weights Experiments Conclusions If learned weight ˆ w does well on the train set examples... ...will it still do well on unseen examples? O. Rivasplata Slide 13 / 40

  14. PAC Learning Motivation data set: D = ( Z 1 , . . . , Z n ) ∈ Z n Classic weights Random weights a finite sequence of input-label examples Z i = ( X i , Y i ). Experiments Conclusions Assumptions: • A data-generating distribution P ∈ M 1 ( Z ). • P is unknown, only the training set is given. The input-label examples are i.i.d. ∼ P . • � Population risk: L ( w ) = E � ℓ ( w , Z ) � = Z ℓ ( w , z ) dP ( z ) (out-of-sample) O. Rivasplata Slide 14 / 40

  15. Certifying performance: test set error Motivation Classic weights � ˆ 1 Test set error: L tst ( ˆ w ) = ℓ ( ˆ w , Z i ) Random weights n tst Experiments Z i ∈ D tst Conclusions ⊲ w obtained from the training set ˆ ⊲ test set not used for training ˆ ⊲ L tst ( ˆ w ) serves as estimate of L ( ˆ w ) Note: L ( ˆ w ) remains unknown! ⊲ O. Rivasplata Slide 15 / 40

  16. Certifying performance: confidence bound Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) w ) ≤ ˆ For ˆ w = ALG (train set) this gives: L ( ˆ L tst ( ˆ w ) + ǫ ( n tst , δ ) Recommendable practice: ⊲ report confidence bound together with your test set error estimate O. Rivasplata Slide 16 / 40

  17. Self-certified learning? Motivation Risk upper bound: Classic weights For any given δ ∈ (0 , 1), Random weights with probability at least 1 − δ over random datasets Experiments of size n , simultaneous for all w : ✞ ☎ Conclusions L ( w ) ≤ ˆ ✝ ✆ L n ( w ) + ǫ ( n , δ ) Alternative practice: Find ˆ w by minimizing the risk bound ⊲ A form of regularized ERM ⊲ the learned ˆ w comes with its own risk certificate ⊲ best if the risk bound is non-vacuous, ideally tight! ⊲ may avoid the need of data-splitting ⊲ may lead to self-certified learning! O. Rivasplata Slide 17 / 40

  18. Probabilistic Neural Nets O. Rivasplata Slide 18 / 40

  19. Randomized weights Motivation Based on data D , learn a distribution over weights: � Classic weights Q D ∈ M 1 ( W ), Q D = ALG (train set). Random weights Experiments Predictions: Conclusions � • draw w ∼ Q D and predict with the chosen w . • each prediction with a fresh random draw. The risk measures L ( w ) and ˆ L n ( w ) are extended to Q by averaging: � Q [ L ] ≡ W L ( w ) dQ ( w ) = E w ∼ Q [ L ( w )] � Q [ ˆ W ˆ L n ( w ) dQ ( w ) = E w ∼ Q [ ˆ L n ] ≡ L n ( w )] O. Rivasplata Slide 19 / 40

  20. Two usual PAC-Bayes bounds Motivation ‘prior’ ‘posterior’ Classic weights Fix a distribution Q 0 . Random weights For any sample size n , Experiments for any confidence parameter δ ∈ (0 , 1), Conclusions with probability at least 1 − δ over random samples (of size n ) simultaneously for all distributions Q ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � δ Q [ L ] ≤ Q [ ˆ L n ] + (PB-classic) ✒ ✑ 2 n ✎ ☞ KL ( Q � Q 0 ) + log � 2 √ n � δ kl( Q [ ˆ L n ] � Q [ L ]) ≤ (PB-kl) ✍ ✌ n O. Rivasplata Slide 20 / 40

  21. Two more PAC-Bayes bounds Fix a distribution Q 0 . For any size n , for any confidence δ ∈ (0 , 1), with probability at least 1 − δ over random samples (of size n ) PB-quad: simultaneously for all distributions Q ✛ ✘ 2 � KL ( Q � Q 0 ) + log( 2 √ n � KL ( Q � Q 0 ) + log( 2 √ n   δ ) δ )       Q [ ˆ   Q [ L ] ≤ L n ] + +       2 n 2 n     ✚   ✙ PB-lambda: simultaneously for all distributions Q and λ ∈ (0 , 2) ✓ ✏ KL ( Q � Q 0 ) + log( 2 √ n Q [ L ] ≤ Q [ ˆ δ ) L n ] 1 − λ/ 2 + ✒ ✑ n λ (1 − λ/ 2) O. Rivasplata Slide 21 / 40

  22. Cornerstone: change of measure inequality Donsker & Varadhan (1975), Csisz´ ar (1975) Motivation Classic weights � � Q [ f ] − log Q 0 [ e f ] KL ( Q � Q 0 ) = sup Random weights f : W → R Experiments Conclusions Let f : Z n × W → R . For a given Q 0 : � Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log Q 0 [ e f ( D , w ) ]. Apply Markov’s inequality to Q 0 [ e f ( D , w ) ]. � w.p. ≥ 1 − δ over the random draw of D ∼ P n , � simultaneously for all distributions Q : Q [ f ( D , w )] ≤ KL ( Q � Q 0 ) + log P n [ Q 0 [ e f ( D , w ) ]] + log(1 /δ ). Use with suitable f , � upper-bound the exponential moment P n [ Q 0 [ e f ( D , w ) ]]. O. Rivasplata Slide 22 / 40

  23. Using a PAC-Bayes bound Motivation Use your favourite ALG to find Q D = ALG (train set), and � Classic weights plug Q D into the PAC-Bayes bound to certify its risk: Random weights ✓ ✏ Experiments KL ( Q D � Q 0 ) + log � 2 √ n � Conclusions � δ Q D [ L ] ≤ Q D [ ˆ L n ] + ✒ ✑ 2 n Use the PAC-Bayes bound itself as a training objective: � ✓ ✏ KL ( Q � Q 0 ) + log � 2 √ n � � Q [ ˆ δ Q D ∈ arg min L n ] + 2 n ✒ ✑ Q Note: both uses illustrated here with PB-classic, but the same can be done with PB-quad or PB-lambda (or any other) O. Rivasplata Slide 23 / 40

  24. Training objectives KL ( Q � Q 0 ) + log( 2 √ n � δ ) f classic ( Q ) = Q [ ˆ L ce n ] + 2 n 2 KL ( Q � Q 0 ) + log( 2 √ n KL ( Q � Q 0 ) + log( 2 √ n � �   δ ) δ )       Q [ ˆ L ce   f quad ( Q ) = n ] + +      2 n 2 n        KL ( Q � Q 0 ) + log( 2 √ n f lambda ( Q , λ ) = Q [ ˆ L ce δ ) n ] 1 − λ/ 2 + n λ (1 − λ/ 2) O. Rivasplata Slide 24 / 40

  25. Experiments O. Rivasplata Slide 25 / 40

  26. PAC-Bayes with Backprop Motivation Classic weights Random weights Experiments Conclusions O. Rivasplata Slide 26 / 40

Recommend


More recommend