Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI - PowerPoint PPT Presentation

Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI Seminar Feb 18, 2020 1

Today 1. What is an Interaction Effect? 2. Interaction Effects in Neural Networks Based on: • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana   • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. ◦ Lengerich, Xing, Caruana 2

Why do we care about interaction effects? • Interpreting models • Identifiability • Understanding how big machine learning models work 3

What is an Interaction Effect? Intuitively: “E ff ect of one variable changes based on the value of another variable” But this definition is incomplete: 3 stories 4

Is “AND” an Interaction Effect? 0 1 Suppose we data: X 2 Y = AND ( X 1 , X 2 ) 0 0 with Boolean . Let’s fit an X 1 , X 2 X 1 additive model (no interactions): Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) 2 1 How well can we fit the data? X 2 0 1 Perfectly*! X 1 5

Is Multiplication an Interaction? Common model: Y = a + bX 1 + cX 2 + dX 1 X 2 But this is equivalent to: Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − α )( X 2 − β ) We can pick any o ff sets without changing the function output. α , β Picking di ff erent values of drastically changes the interpretation. α , β 6

Is Multiplication an Interaction? Y = ( a − d αβ ) + ( b + d β ) X 1 + ( c + d α ) X 2 + d ( X 1 − a )( X 2 − b ) Picking di ff erent values of drastically changes the interpretation: α , β 100% interaction e ff ect 20% interaction e ff ect 7

Is Multiplication an Interaction? Mean-Center? • Does mean- centering solve this problem? • No — If the correlation ρ ( X 1 , X 2 ) is not zero, then we can’t simultaneously center . X 1 , X 2 , X 1 X 2 • Choosing which term to center changes the interpretation! 8

Is Multiplication an Interaction? One more wrinkle If we say that Y = X 1 X 2 is an interaction e ff ect, then is log( Y ) = log( X 1 X 2 ) = log( X 1 ) + log( X 2 ) an interaction e ff ect? 9

Are “AND”, “OR”, “XOR” the same or different? Suppose we have: Y = f 0 + f 1 ( X 1 ) + f 2 ( X 2 ) + f 3 ( X 1 , X 2 ) Equivalent realizations can look like “AND”, “OR”, or “XOR” 10

Pure Interaction Effects To make things identifiable, let’s define a Pure Interaction E ff ect of k Variables as variance in the outcome which cannot be explained any function of fewer than k variables. This gives us an optimization criterion: maximize the variance of lower-order terms . 11

Functional ANOVA Statistical framework designed to decompose a function into orthogonal functions on sets of input variables. Deep roots: [Hoe ff ding 1948, Huang 1998, Cuevas 2004, Hooker 2004, Hooker 2007] 12

Functional ANOVA Given where , the weighted fANOVA F ( X ) X = ( X 1 , …, X d ) decomposition [Hooker 2004,2007] of is: F ( X ) 2 { f u ( X u ) | u ⊆ [ d ]} = argmin { g u ∈ℱ u } u ∈ [ d ] ∫ ( ∑ g u ( X u ) − F ( X ) ) w ( X ) dX , u ⊆ [ d ] where indicates the power set of features, such that [ d ] d ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , ∀ g v 13

Functional ANOVA Key property 1 (Orthogonality): [Hooker 2004] ∫ f u ( X u ) g v ( X v ) w ( X ) dX = 0 ∀ v ⊆ u , Every function is orthogonal to any function which operates on f u f v any subset of variables in . u When , this means that the functions in the w ( X ) = P ( X ) decomposition are all mean-centered and uncorrelated with functions on fewer variables. 14

Functional ANOVA Key property 2 (Existence and Uniqueness): [Hooker 2004] Under reasonable assumptions on the joint distribution , (e.g. P ( X , Y ) no duplicated variables), the functional ANOVA decomposition exists and is unique . 15

Functional ANOVA Example ρ 1,2 = 0.01 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) ρ 1,2 = 0.99 Y = X 1 X 2 f 1 ( X 1 ) f 2 ( X 2 ) f 3 ( X 1 , X 2 ) 16

Interaction Effects in Neural Networks 17

The Challenge of Finding Interaction Effects • Define: a -order interaction e ff ect has | u | = k k f u • Give input variables, there are a potential: d • interaction e ff ects of order O ( d ) 1 • O ( d 2 ) interaction e ff ects of order 2 • O ( d 3 ) interaction e ff ects of order 3 • … • How do deep nets learn? How do they generalize to test sets? 18

Dropout • “Input Dropout” if we drop input features. • “Activation Dropout” if we drop hidden activations. • Dropout rate will refer to the probability that the variable is set to 0. 19

Dropout Regularizes Interaction Effects • With fANOVA, we can decompose the function estimated by each network into orthogonal functions of k variables. • As we increase the Dropout rate, the estimated function is increasingly made up of low- order e ff ects. 20

Dropout Preferentially Targets High-Order Effects Intuition: Let’s consider Input Dropout. For a pure interaction e ff ect of k variables, all variables must be retained for the interaction e ff ect to k survive. The probability that variables all survive Input Dropout decays k exponentially with . k This balances out the exponential growth in of the size of the k hypothesis space. 21

Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ decomposition, with . Let be perturbed by Input 𝔽 [ Y ] = 0 X X v = { j : ˜ Dropout, and define . Then X j = 0} X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 If a single variable in has been u dropped, then we have no information about f u ( X u ) 22

Dropout Preferentially Targets High-Order Effects X u ] = { f u ( ˜ X u ) | v | = 0 𝔽 X u [ f u ( X u ) | ˜ otherwise 0 • What is the probability that ? | v | = 0 • (1 − p ) | u | • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order k e ff ect. 23

A Symmetry d = 25 • Define: r p ( k ) = (1 − p ) k the e ff ective learning rate of a -order e ff ect. k | ℋ k | = ( k ) d hypothesis • space size • E ff ective learning rate decay and hypothesis space growth in balance k each other out! 24

A Symmetry d = 25 25

Activation Input Act.+Input 26

Activation Input Act.+Input 27

Early Stopping Neural networks tend to start near simple functions, and train toward complex functions [Weigand 1994, De Palma 2019, Nakkiran 2019]. Dropout slows down the training of high-order interactions, making early stopping even more e ff ective. 28

Implications • When should we use higher Dropout rates? • Higher in Later Layers • Lower in ConvNets • Explicitly modeling interaction e ff ects • Dropout for explanations / saliency? 29

Conclusions • Interaction e ff ects are tricky — not everything that looks like an interaction is fully interaction. • Defining pure interaction e ff ects according to the Functional ANOVA gives us an identifiable form. • The number of potential interaction e ff ects explodes exponentially with order, so searching for high-order interaction e ff ects from data is impossible in practice. • Dropout is an e ff ective regularizer against interaction e ff ects. It penalizes higher-order e ff ects more than lower-order e ff ects. 30

Thank You Collaborators: • Eric Xing • Rich Caruana (MSR) • Chun-Hao Chang (Toronto) • Sarah Tan (Facebook) • Giles Hooker (Cornell) • Purifying Interaction Effects with the Functional ANOVA. AISTATS 2020 ◦ Lengerich, Tan, Chang, Hooker, Caruana   • On Dropout, Overfitting, and Interaction Effects in Deep Neural Networks. Under Review 2020. • Lengerich, Xing, Caruana 31

Dropout Preferentially Targets High-Order Effects F ( X ) = ∑ Let with the fANOVA decomposition, with 𝔽 [ Y | X ] = F ( X ) + ϵ f u ( X u ) u ∈ [ d ] ˜ v = { j : ˜ . Let be perturbed by Input Dropout, and define . Then 𝔽 [ Y ] = 0 X j = 0} X X X u ] = ∫ f u ( X u ) P ( X u | ˜ 𝔽 X u [ f u ( X u ) | ˜ X ) dX u = ∫ f u ( X u ) I ( X u \ v = ˜ X u \ v ) P ( X v | ˜ X ) dX u = ∫ f h ( X v , ˜ X u \ v ) P ( X v | ˜ X ) dX v = { f u ( ˜ Advantage of using fANVOA to | v | = 0 X u ) define — these are zero! f u otherwise 0 33

Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI - PowerPoint PPT Presentation

Interaction Effects: Helpful or Harmful? Ben Lengerich CMU AI Seminar Feb 18, 2020 1 Today 1. What is an Interaction Effect? 2. Interaction Effects in Neural Networks Based on: Purifying Interaction Effects with the Functional ANOVA.

BIBLICAL SURVEY Judges - Archaeology Helpful Facts Helpful Facts Neutral Facts Helpful Facts

Cloning Considered Harmful Considered Harmful Cory Kapser and Michael W. Godfrey David R.

Quantifier Elimination Helpful lemmas Let S be a set of sentences. Helpful lemmas Let S be a set

Effects and State Liam OConnor CSE, UNSW (and Data61) Term 2 2019 1 Effects State IO

Harmful Algal Blooms Harmful Algal Blooms = HABs Photo credit: Darren Brandt Foam Scum Paint

QjackCtl Considered Harmful QjackCtl Considered Harmful rncbc a.k.a. a.k.a. Rui Nuno Capela Rui

Lesson 1 LO: I can describe and investigate harmful and helpful microorganisms. Mould is the

the interaction The Interaction interaction models translations between user and system

the interaction physical characteristics of interaction interaction styles the

Helpful Acronyms Helpful Acronyms BMP Best Management Practice NOT Notice of Termination CCD

Helpful PowerPoint Slides The following are simply some slides that we believe are helpful in

When Is Assistance Really Helpful? Wayne Iba Mathematics and Computer Science Westmont Santa

SNR SNR- -cloud interaction cloud interaction cloud interaction SNR SNR cloud interaction

getting active after SCI Traditional Email Interaction: Traditional Email Interaction:

Harmful Algal Blooms: An Overview Aimee Clinkhammer Finger Lakes Water Hub Division of Water,

Phaeocystis globosa : a giant colonial harmful species in the WESTPAC waters LU Songhui

Normal and Unimodular Hierarchical Models Daniel Irving Bernstein and Seth Sullivant North

3.2 Hierarchical Modeling Hao Li http://cs420.hao-li.com 1 Roadmap Last lecture: Viewing

Machine Learning 1 Machine(Learning(in(a(Nutshell ( Data$ Model$ Performance$ Measure$

Multi-Building WiFi Fingerprinting using Bayesian and Hierarchical Supervised Machine Learning

More Than Two Factors Designing experiments with two factors extends easily to experiments with

Constructive Matrix Theory for Higher Order Interaction Vasily Sazonov LPT Orsay, University of

Finding Performance-Optimal Configurations for High-Performance Computing Alexander Grebhahn,

RG Methods for Nuclei and Neutron Stars Achim Schwenk University of Washington / TRIUMF (2006-)