Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles - PowerPoint PPT Presentation

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle Cranmer December 15, 2016

Systematic uncertainties – the known unknowns in science • In science, the data generation process is often not uniquely specified or known exactly, hence to the presence of systematic uncertainties. • Data generation processes are rather formulated as a family of data generation processes parametrized by nuisance parameters. • One of the challenges of applying machine learning to scientific problems is the need to incorporate systematics. 2 / 19

Problem statement • Let us assume a family of data generation processes p ( X , Y , Z ) where X are the data, Y are the target labels, Z are the nuisance parameters specifying systematic uncertainties. • We want to learn a regression function f : X � → S of parameters θ f . • We want inference based on f ( X ; θ f ) to be robust to the value z ∈ Z of the nuisance parameter – which remains unknown at test time . We want a classifier that does not change with systematic variations, even though the data might. 3 / 19

Pivot • We define robustness as requiring the distribution of f ( X ; θ f ) conditional on Z (and possibly Y ) to be invariant with the nuisance parameter Z . That is, such that p ( f ( X ; θ f ) = s | z ) = p ( f ( X ; θ f ) = s | z ′ ) for all z , z ′ ∈ Z and all values s ∈ S of f ( X ; θ f ) . If f satisfies this criterion, then f is known as a pivotal quantity. • Alternatively, this amounts to find f such that f ( X ; θ f ) and Z are independent random variables. 4 / 19

Adversarial Networks • Let consider a classifier f built as usual, minimizing the cross-entropy L f ( θ f ) = E x ∼ X E y ∼ Y | x [− log p θ f ( y | x )] . • We pit f against an adversary network r producing as output a function p θ r ( z | f ( X ; θ f ) = s ) modeling the posterior probability density of the nuisance parameter conditional on f ( X ; θ f ) = s . We set r to minimize the cross entropy L r ( θ f , θ r ) = E s ∼ f ( X ; θ f ) E z ∼ Z | s [− log p θ r ( z | s )] . If the adversary can predict the nuisance parameter from the classifier’s output, then it means that some information about the nuisance parameter is carried out through it: the classifier is dependent on the systematics. 5 / 19

Classifier f Adversary r Z γ 1 ( f ( X ; θ f ); θ r ) f ( X ; θ f ) γ 2 ( f ( X ; θ f ); θ r ) ... ... P ( γ 1 , γ 2 , . . . ) X . . . p θ r ( Z | f ( X ; θ f )) θ f L f ( θ f ) θ r L r ( θ f , θ r ) 6 / 19

Z can be either categorical or continuous • If Z is categorical, then the posterior can be modeled with a standard (probabilistic) classifier. • If Z is continuous, then the posterior can be modeled with a mixture density network . • No constraint on the prior p ( Z ) . Mixture density network 7 / 19

Adversarial training What if the classifier forces the adversary to perform worse by simultaneously maximizing L r ? It should reduce its dependence on the nuisance parameter, shouldn’t it? Formally, let us consider the value function E ( θ f , θ r ) = L f ( θ f ) − L r ( θ f , θ r ) that we optimize by finding the minimax solution θ f , ^ ^ θ r = arg min max θ r E ( θ f , θ r ) . θ f 8 / 19

Theoretical motivation Proposition. If there exists a minimax solution (^ θ f , ^ θ r ) such that E (^ θ f , ^ θ r ) = H ( Y | X ) − H ( Z ) , then f ( · ; ^ θ f ) is both an optimal classifier and a pivotal quantity. Proof (sketch): min max L f ( θ f ) − L r ( θ f , θ r ) θ f θ r = min L f ( θ f ) − E s ∼ f ( X ; θ f ) [ H ( Z | f ( X ; θ f ) = s )] θ f = min L f ( θ f ) − H ( Z | f ( X ; θ f )) θ f ≥ H ( Y | X ) − H ( Z ) where the equality holds when • f is an optimal classifier (in which case L f ( θ f ) = H ( Y | X ) ); • f ( X ; θ f ) and Z are independent random variables (in which case H ( Z | f ( X ; θ f )) = H ( Z ) ). 9 / 19

Alternating stochastic gradient descent 1: for t = 1 to T do 2: for k = 1 to K do ⊲ Update r Sample minibatch { x m , z m , s m = f ( x m ; θ f ) } M 3: m = 1 of size M ; 4: With θ f fixed, update r by ascending its stochastic gradient ∇ θ r E ( θ f , θ r ) := M � ∇ θ r log p θ r ( z m | s m ); m = 1 5: end for Sample minibatch { x m , y m , z m , s m = f ( x m ; θ f ) } M 6: m = 1 of size M ; ⊲ Update f 7: With θ r fixed, update f by descending its stochastic gradient ∇ θ f E ( θ f , θ r ) := M � � � ∇ θ f − log p θ f ( y m | x m ) + log p θ r ( z m | s m ) , m = 1 where p θ f ( y m | x m ) denotes 1 ( y m = 0 )( 1 − s m ) + 1 ( y m = 1 ) s m ; 8: end for 10 / 19

Accuracy versus robustness trade-off • The assumption of existence of a classifier that is both optimal and pivotal may not hold. • However, the value function E can be rewritten as E λ ( θ f , θ r ) = L f ( θ f ) − λ L r ( θ f , θ r ) where λ is a hyper-parameter controlling the trade-off between the performance of f and its independence with respect to the nuisance parameter. Setting λ to a large value enforces f to be pivotal. Setting λ close to 0 constraints f to be optimal. 11 / 19

Toy example • Binary classification of 2D data drawn from multivariate gaussians with equal priors, such 4 that 3 2 � 1 1 � − 0 . 5 �� x ∼ N ( 0 , 0 ) , when Y = 0 , 0 − 0 . 5 1 1 � � 1 0 �� 2 x ∼ N ( 1 , 1 + Z ) , when Y = 1 . 0 1 3 4 4 3 2 1 0 1 2 3 4 • The continuous nuisance parameter Z represents in this case our uncertainty about the exact location of the mean of the second gaussian. We assume a gaussian prior z ∼ N ( 0 , 1 ) . • We assume training data { x i , y i , z i } N i = 1 ∼ p ( X , Y , Z ) . 12 / 19

Standard training without the adversary r 4.0 3.0 1.0 p ( f ( X ) | Z = − σ ) µ 0 0.9 3.5 2.5 p ( f ( X ) | Z = 0) µ 1 | Z = z 0.8 3.0 2.0 p ( f ( X ) | Z = + σ ) Z = + σ 0.7 2.5 1.5 0.6 p ( f ( X )) 2.0 1.0 Z = 0 0.5 1.5 0.5 0.4 1.0 0.0 Z = − σ 0.3 0.5 0.5 0.2 0.0 1.0 0.1 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 f ( X ) (Left) The conditional probability distributions of f ( X ; θ f ) | Z = z changes with z . (Right) The decision surface strongly depends on X 2 . 13 / 19

Reshaping f with adversarial training 4.0 3.0 p ( f ( X ) | Z = − σ ) µ 0 0.84 3.5 2.5 p ( f ( X ) | Z = 0) µ 1 | Z = z 3.0 2.0 0.72 p ( f ( X ) | Z = + σ ) Z = σ 2.5 1.5 0.60 p ( f ( X )) 2.0 1.0 Z = 0 0.48 1.5 0.5 0.36 1.0 0.0 Z = − σ 0.24 0.5 0.5 0.0 1.0 0.12 0.0 0.2 0.4 0.6 0.8 1.0 1.0 0.5 0.0 0.5 1.0 1.5 2.0 f ( X ) (Left) The conditional probability distributions of f ( X ; θ f ) | Z = z are now (almost) invariant with z ! (Right) The decision surface is now independent of X 2 . 14 / 19

Dynamics of adversarial training 0.70 0.65 0.60 L f 0.55 0.50 0.45 1.42 1.41 1.40 1.39 L r 1.38 1.37 1.36 67.5 68.0 68.5 L f − λL r 69.0 69.5 70.0 70.5 0 50 100 150 200 T 15 / 19

High energy physics example • Discriminate between QCD jets ( Y = 0) and W -jets ( Y = 1) from high-level features (data from Baldi et al, arXiv:1603.09349). • Taking some liberty, we consider an extreme categorical nuisance parameter where Z = 0 corresponds to events without pileup, Z = 1 corresponds to events with pileup, for which there are an average of 50 independent pileup interactions overlaid. 16 / 19

Maximizing significance by tuning λ • Since we do not expect to find a classifier f that is both optimal and pivotal, we optimize the accuracy-independence trade-off by tuning λ with respect to a higher level objective. • Cut and count analysis: A natural higher-level context is a hypothesis test of a null with no signal events against an alternate hypothesis that is a mixture of signal and background events. Background = 1000 weighted QCD jets, Signal = 100 weighted boosted W’s. Without systematics, optimizing L f indirectly optimizes the power of a classical hypothesis test. With systematics, we need to balance classification performance against robustness to the nuisance parameter. To this end, we use the Approximate Median Significance (AMS) as higher-level objective. Note that since we are performing a hypothesis test of the null, we only wish to impose the pivotal property on background events. 17 / 19

8 λ = 0 | Z = 0 7 λ = 0 λ = 1 λ = 10 6 λ = 500 5 4 AMS 3 2 1 0 1 0.0 0.2 0.4 0.6 0.8 1.0 threshold on f ( X ) λ = 0 | Z = 0: standard training from p ( X , Y , Z = 0 ) . λ = 0: standard training from p ( X , Y , Z ) . λ = 10: trading accuracy for robustness wrt pileup results in a net benefit in terms of maximum statistical significance. 18 / 19

Summary • We proposed a principled approach based on adversarial networks for building a model whose output can be constrained to be independent of a chosen nuisance parameter (or any random variable). • The method supports both categorical and continuous nuisance parameters. • Control is offered to tune the accuracy versus robustness trade-off in order to maximize a higher-level objective. • We are looking for opportunities of (real) physics use cases! Come talk to us if interested! 19 / 19

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles - PowerPoint PPT Presentation

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle Cranmer December 15, 2016 Systematic uncertainties the known unknowns in science In science, the data generation process is often not

23 Advanced Topics 5: Multi-lingual Models Up until now, we have assumed that in the case of

Learning to Pivot with Adversarial Networks arXiv:1611.01046 Gilles Louppe , Michael Kagan, Kyle

X1D: Create Pivot Tables using Excel 2013 3/07/2018 V1N Create Pivot Tables using Excel 2013 1

Create Pivot Tables using Excel 2008/2013 1/26/2016 V1H Create Pivot Tables using Excel 2008 1

Deep Hep Reading Group 1611.05763 Learning To Reinforcement Learn 1611.02779 SchemaAc

Traveling The PIVOT FOOT is what matters!!! If the pivot foot is lifted the ball MUST be passed

PIVOT TABLES AND CHARTS Leena Razzaq lrazzaq@ccs.neu.edu CS1100 Pivot tables and charts 1

PIVOT TABLES AND CHARTS Leena Razzaq lrazzaq@ccs.neu.edu CS1100 Pivot tables and charts 1

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

Michael Duff Imperial College London based on [arXiv:1301.4176 arXiv:1309.0546 arXiv:1312.6523

Pivot Table Demonstration Tools for LBOHs May 27, 2020 cott Troppy, Surveillance Epidemiologist

Trend Lines, Pivot Tables, and Pivot Charts Objectives Create a line chart and trendline Create

Why is Dual-Pivot Quicksort Fast? Sebastian Wild wild@cs.uni-kl.de 29 September 2015

Dual Pivot Quicksort: Verification and Proof using KeY Jonas Schiffl Karlsruher Institut f ur

Introductiontothelarge chargeexpansion Domenico Orlando Introduction Whos who S. Reffert

REGIONAL RESOURCES PNG LTD IPA Company No: 1-52546 EL 1611- Pagl Porphyry Cu-Au An Emerging

Chapter 10 - Complex Systems and Self-Organization Contents Complex systems. Quantifying

NL -completeness, NL = coNL Evgenij Thorstensen V18 Evgenij Thorstensen NL -completeness, NL =

Hardware Design with VHDL Synthesis of VHDL Code ECE 443 Synthesis of VHDL Code This slide set

Energy and buildings Sierra Club of Virginia 13 October 2020 Buildings and the Energy Grid : An

Modeling dynamic diurnal patterns in high frequency financial data Ryoko Ito 1 1 Faculty of

a South African experience Public Economics for Development, Maputo, July 2017 0 OUTLINE

Expectation Continued: Tail Sum, Coupon Collector, and Functions of RVs CS 70, Summer 2019

Big Data and Internet Thinking Chentao Wu Associate Professor Dept. of Computer Science and