Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI Seminar Feb 18, 2020 1
Today 1. What is an Interaction Effect? 2. Interaction Effects in Neural Networks Based on: • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. ◦ Lengerich, Xing, Caruana 2
Why do we care about interaction effects? • Interpreting models • Identifiability • Understanding how big machine learning models work 3
What is an Interaction Effect? Intuitively: “E ff ect of one variable changes based on the value of another variable” But this definition is incomplete: 3 stories 4
Is “AND” an Interaction Effect? 0 1 Suppose we data: X 2 Y = AND ( X 1 , X 2 ) 0 0 with Boolean . Let’s fit an X 1 , X 2 X 1 additive model (no interactions): Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) 2 1 How well can we fit the data? X 2 0 1 Perfectly*! X 1 5
Is Multiplication an Interaction? Common model: Y = a + bX 1 + cX 2 + dX 1 X 2 But this is equivalent to: Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − α )( X 2 − β ) We can pick any o ff sets without changing the function output. α , β Picking di ff erent values of drastically changes the interpretation. α , β 6
Is Multiplication an Interaction? Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − a )( X 2 − b ) Picking di ff erent values of drastically changes the interpretation: α , β 100% interaction e ff ect 20% interaction e ff ect 7
Is Multiplication an Interaction? Mean-Center? • Does mean- centering solve this problem? • No — If the correlation ρ ( X 1 , X 2 ) is not zero, then we can’t simultaneously center . X 1 , X 2 , X 1 X 2 • Choosing which term to center changes the interpretation! 8
Is Multiplication an Interaction? One more wrinkle If we say that Y = X 1 X 2 is an interaction e ff ect, then is log( Y ) = log( X 1 X 2 ) = log( X 1 ) + log( X 2 ) an interaction e ff ect? 9
Are “AND”, “OR”, “XOR” the same or different? Suppose we have: Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 1 , X 2 ) Equivalent realizations can look like “AND”, “OR”, or “XOR” 10
Pure Interaction Effects To make things identifiable, let’s define a Pure Interaction E ff ect of k Variables as variance in the outcome which cannot be explained any function of fewer than k variables. This gives us an optimization criterion: maximize the variance of lower-order terms . 11
Functional ANOVA Statistical framework designed to decompose a function into orthogonal functions on sets of input variables. Deep roots: [Hoe ff ding 1948, Huang 1998, Cuevas 2004, Hooker 2004, Hooker 2007] 12
Functional ANOVA Given where , the weighted fANOVA F ( X ) X = ( X 1 , …, X d ) decomposition [Hooker 2004,2007] of is: F ( X ) 2 { f u ( X u ) | u ⊆ [ d ]} = argmin { g u ∈ℱ u } u ∈ [ d ] ∫ ( ∑ g u ( X u ) − F ( X ) ) w ( X ) dX , u ⊆ [ d ] where indicates the power set of features, such that [ d ] d ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , ∀ g v 13
Functional ANOVA Key property 1 (Orthogonality): [Hooker 2004] ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , Every function is orthogonal to any function which operates on f u f v any subset of variables in . u When , this means that the functions in the w ( X ) = P ( X ) decomposition are all mean-centered and uncorrelated with functions on fewer variables. 14
Functional ANOVA Key property 2 (Existence and Uniqueness): [Hooker 2004] Under reasonable assumptions on the joint distribution , (e.g. P ( X , Y ) no duplicated variables), the functional ANOVA decomposition exists and is unique . 15
Functional ANOVA Example ρ 1,2 = 0.01 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) ρ 1,2 = 0.99 Y = X 1 X 2 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) 16
Interaction Effects in Neural Networks 17
The Challenge of Finding Interaction Effects • Define: a -order interaction e ff ect has | u | = k k f u • Give input variables, there are a potential: d • interaction e ff ects of order O ( d ) 1 • O ( d 2 ) interaction e ff ects of order 2 • O ( d 3 ) interaction e ff ects of order 3 • … • How do deep nets learn? How do they generalize to test sets? 18
Dropout • “Input Dropout” if we drop input features. • “Activation Dropout” if we drop hidden activations. • Dropout rate will refer to the probability that the variable is set to 0. 19
Dropout Regularizes Interaction Effects • With fANOVA, we can decompose the function estimated by each network into orthogonal functions of k variables. • As we increase the Dropout rate, the estimated function is increasingly made up of low- order e ff ects. 20
Dropout Preferentially Targets High-Order Effects Intuition: Let’s consider Input Dropout. For a pure interaction e ff ect of k variables, all variables must be retained for the interaction e ff ect to k survive. The probability that variables all survive Input Dropout decays k exponentially with . k This balances out the exponential growth in of the size of the k hypothesis space. 21
Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ decomposition, with . Let be perturbed by Input 𝔽 [ Y ] = 0 X X v = { j : ˜ Dropout, and define . Then X j = 0} X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 If a single variable in has been u dropped, then we have no information about f u ( X u ) 22
Dropout Preferentially Targets High-Order Effects X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 • What is the probability that ? | v | = 0 • (1 − p ) | u | • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order k e ff ect. 23
A Symmetry d = 25 • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order e ff ect. k | ℋ k | = ( k ) d hypothesis • space size • E ff ective learning rate decay and hypothesis space growth in balance k each other out! 24
A Symmetry d = 25 25
Activation Input Act.+Input 26
Activation Input Act.+Input 27
Early Stopping Neural networks tend to start near simple functions, and train toward complex functions [Weigand 1994, De Palma 2019, Nakkiran 2019]. Dropout slows down the training of high-order interactions, making early stopping even more e ff ective. 28
Implications • When should we use higher Dropout rates? • Higher in Later Layers • Lower in ConvNets • Explicitly modeling interaction e ff ects • Dropout for explanations / saliency? 29
Conclusions • Interaction e ff ects are tricky — not everything that looks like an interaction is fully interaction. • Defining pure interaction e ff ects according to the Functional ANOVA gives us an identifiable form. • The number of potential interaction e ff ects explodes exponentially with order, so searching for high-order interaction e ff ects from data is impossible in practice. • Dropout is an e ff ective regularizer against interaction e ff ects. It penalizes higher-order e ff ects more than lower-order e ff ects. 30
Thank You Collaborators: • Eric Xing • Rich Caruana (MSR) • Chun-Hao Chang (Toronto) • Sarah Tan (Facebook) • Giles Hooker (Cornell) • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. • Lengerich, Xing, Caruana 31
32
Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA decomposition, with 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ v = { j : ˜ . Let be perturbed by Input Dropout, and define . Then 𝔽 [ Y ] = 0 X j = 0} X X X u ] = ∫ f u ( X u ) P ( X u | ˜ 𝔽 X u [ f u ( X u ) | ˜ X ) dX u = ∫ f u ( X u ) I ( X u \ v = ˜ X u \ v ) P ( X v | ˜ X ) dX u = ∫ f h ( X v , ˜ X u \ v ) P ( X v | ˜ X ) dX v = { f u ( ˜ Advantage of using fANVOA to | v | = 0 X u ) define — these are zero! f u otherwise 0 33
Recommend
More recommend