Causality in a wide sense Lecture IV Peter B uhlmann Seminar for - PowerPoint PPT Presentation

Causality – in a wide sense Lecture IV Peter B¨ uhlmann Seminar for Statistics ETH Z¨ urich

Recap from yesterday data from different known observed environments or experimental conditions or perturbations or sub-populations e ∈ E : ( X e , Y e ) ∼ F e , e ∈ E with response variables Y e and predictor variables X e consider “many possible” but mostly non-observed environments/perturbations F ⊃ E �� observed a pragmatic prediction problem: predict Y given X such that the prediction works well (is “robust”) for “many possible” environments e ∈ F based on data from much fewer environments from E

the causal parameter optimizes a worst case risk: e ∈{F E [( Y e − ( X e ) T β ) 2 ] ∋ β causal argmin β max if F = { arbitrarily strong perturbations not acting directly on Y } agenda for today: consider other classes F ... and give up on causality

Anchor regression: as a way to formalize the extrapolation from E to F ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor A H hidden ? β 0 X Y

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor A H hidden β 0 X Y Y ← X β 0 + ε Y + H δ, X ← A α 0 + ε X + H γ, Instrumental variables regression model (cf. Angrist, Imbens, Lemieux, Newey, Rosenbaum, Rubin,... )

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor hidden A H A is an “anchor” source node! β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

Anchor regression and causal regularization ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) the environments from before, denoted as e : they are now outcomes of a variable A �� anchor hidden A H A is an “anchor” allowing also for source node! feedback loops β 0 X Y ❀ Anchor regression     X X  ← B  + ε + MA Y Y   H H

allow that A acts on Y and H ❀ there is a fundamental identifiability problem cannot identify β 0 this is the price for more realistic assumptions than IV model

... but “Causal Regularization” offers something find a parameter vector β such that the residuals ( Y − X β ) stabilize, have the same distribution across perturbations of A = environments/sub-populations we want to encourage orthogonality of residuals with A something like β = argmin β � Y − X β � 2 ˜ 2 / n + ξ � A T ( Y − X β ) / n � 2 2

˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares ◮ for γ = 0: adjusting for heterogeneity due to A ◮ for 0 ≤ γ < ∞ : general causal regularization

˜ β = argmin β � Y − X β � 2 2 / n + ξ � A T ( Y − X β ) / n � 2 2 causal regularization: ˆ β = argmin β � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n + λ � β � 1 Π A = A ( A T A ) − 1 A T (projection onto column space of A ) ◮ for γ = 1: least squares + ℓ 1 -penalty ◮ for γ = 0: adjusting for heterogeneity due to A + ℓ 1 -penalty ◮ for 0 ≤ γ < ∞ : general causal regularization + ℓ 1 -penalty convex optimization problem

It’s simply linear transformation consider W γ = I − ( 1 − √ γ )Π A , X = W γ X , ˜ ˜ Y = W γ Y then: ( ℓ 1 -regularized) anchor regression is (Lasso-penalized) least squares of ˜ Y versus ˜ X ❀ super-easy (but have to choose a tuning parameter γ )

... there is a fundamental identifiability problem... but causal regularization solves for e ∈F E | Y e − X e β | 2 argmin β max for a certain class of shift perturbations F recap: causal parameter solves for argmin β max e ∈F E | Y e − X e β | 2 for F = “essentially all” perturbations

Model for F : shift perturbations model for observed heterogeneous data (“corresponding to E ”)     X X  = B  + ε + MA Y Y   H H model for unobserved perturbations F (in test data) shift vectors v acting on (components of) X , Y , H     X v X v Y v  = B Y v  + ε + v   H v H v v ∈ C γ ⊂ span ( M ) , γ measuring the size of v i.e. v ∈ C γ = { v ; v = Mu for some u with E [ uu T ] � γ E [ AA T ] }

A fundamental duality theorem ( Rothenh¨ ausler, Meinshausen, PB & Peters, 2018 ) P A the population projection onto A : P A • = E [ •| A ] For any β v ∈ C γ E [ | Y v − X v β | 2 ] = E �� 2 � �� 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ≈ � ( I − Π A )( Y − X β ) � 2 2 / n + γ � Π A ( Y − X β ) � 2 2 / n � �� objective function on data worst case shift interventions ← → regularization! in the population case

for any β worst case test error � �� Y v − X v β � � 2 � max v ∈ C γ E �� 2 � � 2 � = � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) E � �� criterion on training population sample

worst case test error � �� Y v − X v β � � 2 � argmin β max v ∈ C γ E �� 2 � � 2 � = argmin β E � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) � �� criterion on training population sample and “therefore” also finite sample guarantee: β = argmin β � ( I − Π A )( Y − Xu ) � 2 ˆ 2 / n + γ � Π A ( Y − X β ) � 2 2 (+ λ � β � 1 ) leads to predictive stability (i.e. optimizing a worst case risk)

fundamental duality in anchor regression model: v ∈ C γ E [ | Y v − X v β | 2 ] = E �� 2 � �� 2 � � max � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ❀ robustness ← → causal regularization Causality Adversarial Robustness machine learning, Generative Networks e.g. Ian Goodfellow e.g. Judea Pearl

robustness ← → causal regularization the languages are rather different: ◮ causal graphs ◮ metric for robustness Wasserstein, f-divergence ◮ Markov properties on ◮ minimax optimality graphs ◮ perturbation models ◮ inner and outer optimization ◮ identifiability of systems ◮ regularization ◮ transferability of systems ◮ ... ◮ ... mathematics allows to classify equivalences and differences ❀ can be exploited for better methods and algorithms taking “the good” from both worlds!

indeed: causal regularization is nowadays used (still a “side-branch”) in robust deep learning Bouttou et al. (2013), ... , Heinze-Deml & Meinshausen (2017), ... and indeed, we can improve prediction

Stickmen classification ( Heinze-Deml & Meinshausen (2017) ) Classification into { child, adult } based on stickmen images 5-layer CNN, training data ( n = 20 ′ 000) 5-layer CNN 5-layer CNN with some causal regularization training set 4% 4% test set 1 3% 4% test set 2 (domain shift) 41 % 9 % in training and test set 1: children show stronger movement than adults in test set 2 data: adults show stronger movement spurious correlation between age and movement is reversed!

Connection to distributionally robust optimization (Ben-Tal, El Ghaoui & Nemirovski, 2009; Sinha, Namkoong & Duchi, 2017) P ∈P E P P [( Y − X β ) 2 ] argmin β max perturbations are within a class of distributions P = { P ; d ( P , P 0 ) ≤ ρ } �� emp. distrib. the “model” is the metric d ( ., . ) and is simply postulated often as Wasserstein distance Perturbations from distributional robustness metric d(.,.) radius rho

our anchor regression approach: b γ = argmin β max v ∈ C γ E [ | Y v − X v β | 2 ] perturbations are assumed from a causal-type model the class of perturbations is learned from data

anchor regression robust optimization learned from data amplified pre−specified radius perturbations anchor regression: the class of perturbations is an amplification of the observed and learned heterogeneity from E

Science aims for causal understanding ... but this may be a bit ambitious... in absence of randomized studies, causal inference necessarily requires (often untestable) additional assumptions in anchor regression model: we cannot find/identify the causal (“systems”) parameter β 0 hidden A H β 0 X Y

The parameter b →∞ : “diluted causality” b γ = argmin β E �� 2 � � 2 � � ( Id − P A )( Y − X β ) + γ E � P A ( Y − X β ) ) b →∞ = lim γ →∞ b γ by the fundamental duality: it leads to “invariance” the parameter which optimizes worst case prediction risk over shift interventions of arbitrary strength it is generally not the causal parameter but because of shift invariance: name it “diluted causal” note: causal = invariance w.r.t. very many perturbations

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for - PowerPoint PPT Presentation

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for Statistics ETH Z urich Recap from yesterday data from different known observed environments or experimental conditions or perturbations or sub-populations e E : ( X

Simultaneous Causality: Part IV on Causality James J. Heckman Econ 312, Spring 2019 1 / 29

AEFI Causality Assessment Approach to causality assessment in deaths following immunization

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z

Econometric Causality: Part I on Causality Based in part on Heckman (2008) International

Causality and Algebraic Geometry Andrew Critch UC Berkeley September, 2012 Causality and

Granger Causality and Dynamic Structural Systems Halbert White and Xun Lu Department of

Causality V. Bunkin, L. Steffen (Seminar in Statistics) Causality 02.05.2016 1 / 23

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Causality and the benefits of relocation Causality and the benefits of relocation Presentation to

Causality Along Subspaces Majid Al-Sadoon University of Cambridge Royal Economic Society Fifth

Causality: Explanation versus Prediction Department of Government London School of Economics and

Expressing Causality in Categorical Models of Functional Reactive Programming Wolfgang Jeltsch

SwarnaJayanti Fellowship 2015-16 List of Candidates for Presentation to National Core Committee

USING NSIGHT TOOLS TO OPTIMIZE THE NAMD MOLECULAR DYNAMICS SIMULATION PROGRAM Robert (Bob) Knight

Spot the Differences Find the 4 differences between the images on the next slide Answers Fill in

iGEM at William & Mary Caroline Golino (CAMS, 17) John Marken (Mathematics, 17)

David Wilson Library & Media Zoo Do I need to join the library? Clinical Sciences Library

RNA From Mathematical Models to Real Molecules 4. Experiments with RNA Molecules Peter

New York Pharma Forum NIF Ventures JUNE 23, 2005 Goro Takeda New York Pharma Forum June 23,

Potential use of alkaline hydrogen peroxide in biomass pretreatment and valorization a review

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for - PowerPoint PPT Presentation

Causality in a wide sense Lecture IV Peter B uhlmann Seminar for Statistics ETH Z urich Recap from yesterday data from different known observed environments or experimental conditions or perturbations or sub-populations e E : ( X

Simultaneous Causality: Part IV on Causality James J. Heckman Econ 312, Spring 2019 1 / 29

AEFI Causality Assessment Approach to causality assessment in deaths following immunization

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture III Peter B uhlmann Seminar for Statistics ETH Z

Econometric Causality: Part I on Causality Based in part on Heckman (2008) International

Causality and Algebraic Geometry Andrew Critch UC Berkeley September, 2012 Causality and

Granger Causality and Dynamic Structural Systems Halbert White and Xun Lu Department of

Causality V. Bunkin, L. Steffen (Seminar in Statistics) Causality 02.05.2016 1 / 23

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture I Peter B uhlmann Seminar for Statistics ETH Z

Causality in a wide sense Lecture II Peter B uhlmann Seminar for Statistics ETH Z

Word Sense Word Sense Word Sense Disambiguation Disambiguation Disambiguation Presented by

Causality and the benefits of relocation Causality and the benefits of relocation Presentation to

Causality Along Subspaces Majid Al-Sadoon University of Cambridge Royal Economic Society Fifth

Causality: Explanation versus Prediction Department of Government London School of Economics and

Expressing Causality in Categorical Models of Functional Reactive Programming Wolfgang Jeltsch

SwarnaJayanti Fellowship 2015-16 List of Candidates for Presentation to National Core Committee

USING NSIGHT TOOLS TO OPTIMIZE THE NAMD MOLECULAR DYNAMICS SIMULATION PROGRAM Robert (Bob) Knight

Spot the Differences Find the 4 differences between the images on the next slide Answers Fill in

iGEM at William &amp; Mary Caroline Golino (CAMS, 17) John Marken (Mathematics, 17)

David Wilson Library &amp; Media Zoo Do I need to join the library? Clinical Sciences Library

RNA From Mathematical Models to Real Molecules 4. Experiments with RNA Molecules Peter

New York Pharma Forum NIF Ventures JUNE 23, 2005 Goro Takeda New York Pharma Forum June 23,

Potential use of alkaline hydrogen peroxide in biomass pretreatment and valorization a review

iGEM at William & Mary Caroline Golino (CAMS, 17) John Marken (Mathematics, 17)

David Wilson Library & Media Zoo Do I need to join the library? Clinical Sciences Library