under understandi anding ng blac ack bo box x pr
play

Under Understandi anding ng Blac ack-bo box x Pr Predictions - PowerPoint PPT Presentation

Under Understandi anding ng Blac ack-bo box x Pr Predictions via Influence Fu Funct ctions Pang Wei Koh & Perry Liang Presented by Theo, Aditya, Patrick 1 Roadmap 1.Influence functions: definitions and theory 2.Efficiently


  1. Under Understandi anding ng Blac ack-bo box x Pr Predictions via Influence Fu Funct ctions Pang Wei Koh & Perry Liang Presented by – Theo, Aditya, Patrick 1

  2. Roadmap 1.Influence functions: definitions and theory 2.Efficiently calculating influence functions 3. Validations 4. Uses cases 2

  3. Approach Reviving an “old technique” from Robust statistics: Influence • function Cook & Weisberg (1980): regression models can be strongly influenced by a few cases and reflect unusual features of those cases than the overall relationships between the variables . Find those influential points. What is the influence of a training point on a model ? • ”Explain the model through the lens its training data” 3 2.

  4. Approach A A bit of forma malism : 1. 0 123456789103, ; < = > < , ? < ∈ ℝ B , ℝ , 9 ≤ 0 2. E93F GH0I8910 ∶ E: Θ → ℝ: M → ∑O(; < , M) 3. STU959I7V 593F T909T9;45 W M = 75XT90 Y E(M) 4. Hessian function (applied at W Y = ∇ e E( W M) b c M ) What do we actually need ? G: f → ℝ Let’s assume smoothness and regularity 1. l7?V15 − S>U703910: G > + ℎ p→q G > + ∇ r G ⋅ ℎ + 1 ℎ = 2. Chain rule: ∇ r G ∘ X > = ∇ r G(X > ⋅ ∇ r X > v(p) Landau notation - 1 ℎ = u: f → ℝ → 0, ℎ → 0} p f 4

  5. How would the model’s prediction change if we did not have this training point ? • Formally, introduce a perturbation in terms of loss !"# $ %&'() *, , - .,/ = $#%1&) 2 3 - + 56 *, - We are interested in the parameters change when removing z. • 6 z A , - = , argmin = - .BCD 2 E / > ?/ 6 z A , - = , - .BF (= , argmin = -) 2 / > • Change the entire parameters and retrain? Costly. • Much easier to have a simple approximation - .BF = K2 - I − , , K. | .BF ⋅ ℎ + " ℎ 5

  6. How would the model’s prediction change if we did not have this training point ? 567.7865 ! " #,% • Need access to: !& | &() ≝ + ,-,-./.01 2 = ⏞ (*) • MAGIC: !" !& | &() ≝ + ,-,-./.01 2 = ⋯ = −; ⋅ ∇ " >(2, @ A) 6

  7. Influence function derivation Under the hood: Risk function TWICE-DIFFERENTIABLE, CONVEX • Hessian: (Diagonalizable) definite positive matrix Derivation: 1 st -order optimality condition: • % % ∇ " # $ & ' + 5 ⋅ ∇ " 6(8, & ' ) = 0 : / ()*+,-: / " 0 1 ∗ ⋅'1, ' " 0 1 ∗ ⋅'1, ' ∇ " #( % & > + ∗ ⋅ 5 + ? 5 ) + 5 ⋅ ∇ " 6 8, % & > + ∗ ⋅ 5 + ? 5 = 0 F’(x).h F(x) F(x) F’(x).h @ 6 ∇ " #( % @ # & > ⋅ ∗ ⋅ 5 + 5 { ∇ " 6 8, % % % & > ) + ∇ " & > + ∇ " & > ∗ ⋅ 5} + ?(5) = 0 Linear & > ⋅ ∗ ⋅ 5 @ + ? 5 =0 0 + C ⋅ ∗ ⋅ 5 + ∇ " 6 8, % @ 6 8, % & > ⋅ 5 + ∇ " system+ make 5 → 0 & > ⋅ 5 + C) EF ⋅ ∇ " 6 8, % @ 6 8, % ∗ = −( ∇ " & > 7

  8. Effect on a test point !" !# | #%& ≝ ( )*,*,-,./ 0 Why again? (*) = • 6 Explicit formula: 1 2 34 − 1 2 = − 7 ( )*,*,-,./ (0) • How to Compare this change of weights ? • ( )*,:;// 0, 0 <=/< ≝ >? 0 <=/< , 1 2 #,4 A >@ #%& E F G MAGIC = ⋯ = −∇ " ? 0 <=/< , 1 36 ∇ " ? 0, 1 2 2 " CH CHAIN RULE LE 8

  9. Interpreting the upweight Debugging, understanding a model ? • What does it mean ? ! "#,%&'' (, ( )*') > 0 → Remove z will • make the loss higher. Geometric interpretation New inner product, a new geometry based on our model’s • weight 6 7 8 = −∇ 2 3 ( )*') , 4 9: ∇ 2 3 (, 4 − ∇0 ( )*') ∇0 ( 5 5 2 9

  10. What do we mean by influence? • How does it relate to the ‘influence’ of a point ? • Logistic regression model: 6 7 8 = : 7; < 8 = >?,ABCC D, D EFCE < = KL ⋅ 8 −7 EFCE 7 ⋅ : −7 EFCE ; < 8 EFCE ⋅ : −7; < 8 ⋅ 8 EFCE < I J Influence of high training loss • Resistance matrix • 10

  11. Efficiently Calculating the Influence Optimally we can determine influence of a training point ! " by • leaving out zi, regenerating and assessing the resulting model on z ! #$%# ( EXPENSIVE ) We came up with cheaper approximation • 4 5 6 & '(,*+%% (!) − ∇ 0 1 ! #$%# , 2 78 ∇ 0 1 !, 2 3 3 0 Ya Yay?? •

  12. Calculating the Inverse Hessian 3 4 5 ! "#,%&'' ()) − ∇ - . ) /0'/ , 1 67 ∇ - . ), 1 2 2 - 6E is still quite expensive to compute The in inverse H Hessia ian C 5 • D 6E is O ( np 2 + With n training samples and a model with 2 in R p , C 5 + p 3 ) • D 67 ) for each Remember , we want to find ! "#,%&'' ()) ( and thus 4 5 • - training point

  13. Inverse Hessian Estimation ∇ " # $, & H ' ∇ " # $, & +, ∇ " # $, & ' ' ) * ' " Conjugate Gradients Perlmutter ”Trick” Stochastic Estimation

  14. Perlmutter ‘Trick’ # and arbitrary vector v in R d can calculate For Hessian ! " • ! " # v without explicitly knowing ! " # The operation is O(d) • If $ is very small and ! " # is the Hessian of the function L we can use central difference approximation to formulate ! " # v # % ≈ ' ( + $%) − '(( − $% ! " 2$ Perlmutter’s Trick is also an approximation, but more robust to errors from small $

  15. Conjugate Gradient Now that we know ! " # v we want to efficiently construct • &' v ! $ % ' % , − v - we’ll find ! $ &' v + , - ! $ If we minimize ( t = • % &' v At t min , 0 = ∇( = ! $ % , 234 − v meaning , 234 = ! $ • %

  16. Conjugate Gradient Algorithm Start with ! " ∈ $ % &'( )*+ ( " = - " = −/0 ! " 5 3 4 3 4 Ø 1 2 = 9 6 4 , - 2 = ∇0(! 2 ) 5 7 8 6 4 Ø ! 2>? = ! 2 + 1 2 d 2 5 3 4CD 3 4CD Ø B 2 = 5 3 4 3 4 Ø ( 2>? = - 2>? + B 2>? d 2 Repeat n times!!

  17. Problems with CG There are problems with CG in particular for models with • many parameters At each iteration we’re doing Hessian evaluations O(p) and • we in principle do p iterations. As a result, the authors suggest another approximation • algorithm for inverting the Hessian -- Stochastic Evaluation

  18. Stochastic Estimation #$ ≡ ∑ ' ( − ! ' and recasting it recursively we have: Using a Taylor expansion for ! " #$ = ( + ( − ! ! #$ ! " "#$ This suggests a sampling algorithm to estimate the inverse Hessian based on expectations. Ø FGHIJK L IMNOPL Q ' #$ Ø Use the samples to evaluate ! " Repeat n times!!

  19. Experimental results ZOOM IN MNIST • Influence function • compared to… Euclidian distance ? ? x <=>< . x 19

  20. Experimental results 01 ⋅ - (") − % &'(& % ⋅ * −% &'(& + , - &'(& ⋅ * −%+ , - ⋅ - &'(& , . / 01 ⋅ - (2) − % &'(& % ⋅ * −% &'(& + , - &'(& ⋅ * −%+ , - ⋅ - &'(& , . / 01 ⋅ - (3) − % &'(& % ⋅ * −% &'(& + , - &'(& ⋅ * −%+ , - ⋅ - &'(& , . / 20

  21. Comparison with leave one out (logistic) Trained basic logistic regression on MNSIT For a given misclassified ! "#$" Compared every |&{!, ! "#$" }|| for every ! ∈ ! "+,-. 0 For the top 500 compare − . |& 1,1 2342 || vs change in loss with ! removed and retrained. Tested with both conjugate gradient (left) and stochastic gradient (right) 21

  22. Comparison with leave one out – non convexity (CNN) For non-convex example on SGD For output of SGD (local not global max) replace Loss function with second order convex approximation $ + 1 $) * $ − , ! ", $ + ∇!(", ( 2 ($ − ( $) * (/ 0 + 12)($ − 3 $) Where , $ is the resulting parameters from SGD and 1 is a damping term if / 0 is not Positive definite (convexifying it) Claim – if , $ is close to the true optimal then this approximation is close to a Newton Step Heavily relies on , $ being close to the true optimal (no clarification on how close) 22

  23. Comparison with leave one out – non convexity (CNN) Compute |"{$, $ &'(& }|| with new loss function and then see how well it compares to leave one out. Tested on CNN – compared influence function vs output of includes function. (right) Pearson Correlation = 0.86 respectively high correlation High correlation 23

  24. Non-differentiable losses !"#$% & = max 0 , 1 − & /0112ℎ!"#$% &, 2 = tlog(1 + exp(1 + exp 1 − & ) 2 • Key idea: Train the initial model on your non- differentiable loss, use a smooth approximation for the influence • Scalable? 24

  25. Understanding Model Behaviour Task: Classifying Dog vs Fish • Two Models • Logistic Regression Model on top of Inception v1 • features RBF SVM Model •

  26. Understanding Model Behaviour

  27. How would the model’s predictions change if the training input were modified ? • Formally, introduce the perturbation 4 5 = 7 + 9, : No need for 9 to be infinitesimal • New optimal parameters, new risk function @ A B,C D ,C = EFGHIJ K {M A + NO 4 5 , A − NO 4, A } • Under the hood ? Exact same derivation . [ Explicit formula: Y A B,C D ,C - Y A = − \ ( ] ^_,_`a`bc 4 5 − ] ^_,_`a`bc 4 ) • One final approximation if we make 9 infinitesimal: jO 4 fecf , @ A B,C D ,C ] _eaf,ghcc 4, 4 fecf ≝ k jN Blm p q r p MAGIC = ⋯ = −∇ K O 4 fecf , @ s[ ∇ t ∇ K O 4, @ A A K 27

Recommend


More recommend