Classification with a control channel Dont cheat yourself! Gilles - PowerPoint PPT Presentation

Classification with a control channel Don’t cheat yourself! Gilles Louppe (@glouppe) Tim Head (@betatim)

Disclaimer The following applies only for the learning protocol of the Flavours of Physics Kaggle challenge. See notebook for further details. 2 / 15

Flavours of Physics: Finding τ �→ µµµ challenge Given a learning set L of • simulated signal events ( x , s ) • real data background events ( x , b ) , build a classifier ϕ : X �→ { s , b } for distinguishing τ �→ µµµ signal events from background events. 3 / 15

Control channel test The simulation is not perfect: discriminative patterns exist between simulated and real data events. To avoid exploiting simulation versus real data artefacts to classify signal from background events, we evaluate whether ϕ behaves differently on simulated signal and real data signal from a control channel C . Here, the control channel test consists in requiring the Kolmogorov-Smirnov test statistic between { ϕ ( x ) | x ∈ C sim } and { ϕ ( x ) | x ∈ C data } to be strictly smaller than some pre-defined threshold t . 4 / 15

Proposition Assuming that • control data can be distinguished from training data with high confidence, • simulated features are more discriminative than they are in real data, Then, even by chance, ϕ might exploit simulation versus real data artefacts to classify signal from background events, while still passing the control channel test. Therefore, • The true performance of ϕ on real data may be significantly different (typically lower) than estimated on simulated signal events versus real data background events. • Passing the KS test does not tell you anything about ϕ . 5 / 15

Toy example Let us consider an artificial classification problem between signal and background events, along with some close control channel data C sim and C data . Let us assume an input space defined on three input variables X 1 , X 2 , X 3 as follows. 6 / 15

X 1 is irrelevant for distinguishing real data signal from real data background but, because of simulation imperfections, has discriminative power between simulated events and real data events. 7 / 15

X 2 is discriminative between signal and background events. 8 / 15

X 3 is discriminative between events from the original problem and the control channel, but has otherwise no discriminative power between signal and background events. 9 / 15

Random exploration from sklearn.ensemble import ExtraTreesClassifier def find_best_tree(X_train, y_train, X_test, y_test, X_data, y_data, X_control_sim, X_control_data): best_auc_test, best_auc_data = 0, 0 best_ks = 0 best_tree = None for seed in range(2000): clf = ExtraTreesClassifier(n_estimators=1, max_features=1, max_leaf_nodes=5, random_state=seed) clf.fit(X_train, y_train) auc_test = roc_auc_score(y_test, clf.predict_proba(X_test)[:, 1]) auc_data = roc_auc_score(y_data, clf.predict_proba(X_data)[:, 1]) ks = ks_statistic(clf.predict_proba(X_control_sim)[:, 1], clf.predict_proba(X_control_data)[:, 1]) if auc_test > best_auc_test and ks < 0.09: best_auc_test = auc_test best_auc_data = auc_data best_ks = ks best_tree = clf return best_auc_test, best_auc_data, best_ks, best_tree 10 / 15

Random exploration auc_test, auc_data, ks, tree = find_best_tree(X_train, y_train, X_test, y_test, X_data, y_data, X_control_sim, X_control_data) print "ROC AUC (simulated signal vs. data background) =", auc_test print "ROC AUC (data signal vs. data background) =", auc_data print "KS statistic =", ks >>> ROC AUC (simulated signal vs. data background) = 0.986357983199 >>> ROC AUC (data signal vs. data background) = 0.90973817 >>> KS statistic = 0.0578 11 / 15

What just happened? By chance, we have found a classifier that • has seemingly good test performance (AUC=0.986 on simulated signal versus real data background); and • passes the control channel test that we have defined. This classifier appears to be exactly the one we were seeking. Wrong. The expected ROC AUC of 0.91 on real data signal and real data background is significantly lower than our first estimate, suggesting that there is still something wrong. 12 / 15

ϕ exploits X 1 , i.e. simulation versus real data artefacts to indirectly classify signal from background events, while still passing the control channel test because of its use of X 3 ! 13 / 15

Winning the challenge As in the challege, simulation versus real data patterns may be hidden into several variables, making it not possible to detect the problem by eye by looking at variables individually. However, a learning algorithm might still be able to exploit them, either by chance or on purpose. Recipe for winning the challenge: 1. learn to distinguish between training and control data, 2. build a classifier on training data, with all the freedom to exploit simulation artefacts, 3. assign random predictions to samples predicted as control data, otherwise predict using the classifier found in the previous step. 14 / 15

A better protocol If differences between simulated and real data events are fixed, then the problem goes away. One way to do it is to learn a transformation (e.g., a reweighting) from simulation onto real data from the control channel, and then learn on transformed simulated signal events versus real data background events. 15 / 15

Classification with a control channel Dont cheat yourself! Gilles - PowerPoint PPT Presentation

Classification with a control channel Dont cheat yourself! Gilles Louppe (@glouppe) Tim Head (@betatim) Disclaimer The following applies only for the learning protocol of the Flavours of Physics Kaggle challenge. See notebook for further

CHANNEL ALLOCATION Channel Language Translation Channel Translation Language Channel 1 German

ANNUAL ACCOUNTS PRESS CONFERENCE CHANNEL ALLOCATION. Channel Language Translation Channel

CHEAT SHEETS ARE THEY REALLY SAVING YOU ANYTHING? Preparing for the Transition to ICD-10

Channel Assignment and Channel Hopping in IEEE 802.11 Operating Channels for 802.11b Europe

GMBA 7098: Statistics and Data Analysis (Fall 2014) Do Sumo Wrestlers cheat? Ling-Chieh Kung

ANNUAL ACCOUNTS PRESS CONFERENCE LANGUAGE CHANNELS. Channel Language Channel (translation)

Channel design Channel coverage Intensive Selective Exclusive Channel

1 Simultaneous interpretation EN channel 1 FR channel 2 ES channel 3 DE channel 4 2 The Future

They Don t Want Them Or You t Want Them Or You They Don Don t Have Them: t Have

Don Juans Troubles Don Juans Troubles Hey, Anna, how are you? Don Juans Troubles Hey,

Formal Modeling in Cognitive Science 1 Noisy Channel Model Channel Capacity Lecture 29: Noisy

OVERVIEW U.S. National Vegetation Classification A Classification Partnership Don Faber-

C.R.E.A.T.E.S. Cheat River Ephemeral Access Treatment and Enhancement Strategy Mission Statement

Instructor: Chi Tse (Ricky) T o pic s: vi cheat sheet Escape characters / sequences

Android Multilib Build Cheat Sheet Presented by Amit Pundir twitter: pundiramit irc: pundir at

The elf in ELF use 0-day to cheat all disassemblers david942j @ CyberSEC 2019 1 . 1 This talk

Decision Trees ID3 A Python implementation Daniel Pettersson 1 Otto Nordander 2 Pierre Nugues 3 1

Pitfalls of evaluating a classifiers performance in high energy physics applications Gilles

Stability and Stabilization of polynomial dynamical systems Hadi Ravanbakhsh Sriram

The support is a morphism of monads Sharwin Rezagholi 1 Tobias Fritz 2 Paolo Perrone 1 1 Max Planck

3D Data visualization with Mayavi Prabhu Ramachandran Department of Aerospace Engineering IIT

Applications of Machine Learning in Engineering (and Parameter Tuning) Lars Kotthofg University

RISMA: A Rule-based Interval State Machine Algorithm for Performance Analysis, Alerts Generation,

The SRB service at STFC and the road to iRODS(?) Roger Downing Kevin ONeill iRODS

Sambuz

Useful Links

Newsletter

Mail Us