higgs machine learning challenge experience a hep pattern
play

Higgs Machine Learning Challenge experience. A HEP pattern - PowerPoint PPT Presentation

Higgs Machine Learning Challenge experience. A HEP pattern recognition challenge ? David Rousseau LAL-Orsay 10th February 2015 CTD 2015, Berkeley Outline q Machine Learning, Challenges q The Higgs Machine Learning challenge q A


  1. Higgs Machine Learning Challenge experience. A HEP pattern recognition challenge ? David Rousseau LAL-Orsay 10th February 2015 CTD 2015, Berkeley

  2. Outline q Machine Learning, Challenges … q The Higgs Machine Learning challenge q A HEP pattern recognition challenge ? David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 2

  3. Machine Learning and HEP q Neural Nets used somewhat in the 90’ies (e.g. LEP) q BDT (Adaboost) invented in 97 q MVA techniques (= Machine Learning) have been used extensively at D0/CDF (mostly BDT, but not only) in the 00’ies q Atlas/CMS less eager to adopt MVA at LHC starts for some good reasons: o Need to understand well the input variables first o Still a lot to gain by improving input variables o Systematics more difficult to evaluate o Collected luminosity was increasing fast q But lot of work recently with MVA techniques o Competition o Best use of available data q Meanwhile Neural Net reappear in their “deep” incantation (See Peter Sadowski’s talk this afternoon) David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 3

  4. Machine Learning in HEP (2) q However: o TMVA, within Root, has been instrumental in popularising MVA technique within HEP o Most people using TMVA, most people using BDT in TMVA o Although getting a reasonable answer from TMVA is quick and easy, it takes time to really become an expert with e.g. BDT o People are focussing on the choices of input variables and the evaluation of systematics (which of course are excellent things to do) q Not much work on studying possible better MVA techniques, for which you need the software and the know-how David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 4

  5. Challenge ? q Challenges have become in the last 10 years a common way of working for the machine learning community q Machine learning scientists are eager to test their algorithms on real life problems è more valuable(=publisheable) than artificial problems q Company or academics want to outsource a problem to machine learning scientist, but also geeks etc. The company sets up a challenge like: o Netflix : predict movie preference from past movie selection o Gesture recognition o Separating pictures of cats from pictures of dogs o NASA/JPL mapping dark matter through (simulated) galaxy distortion o … q Some companies makes a business from organising challenges: datascience.net, kaggle q A few recent examples now… David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 5

  6. Looking Looking at People eople (2012-14) 2012-14) Actions Interactions Wave Point Clap Shake Hands Hug Fight http://chalearn.org/ David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 6

  7. Neur Neural al connect connectomics omics (2015) 2015) http://chalearn.org/ David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 7

  8. NE NEW: W: Aut utoM oML challenge hallenge (2015) 2015) Fully automatic machine learning without ANY human intervention http://codalab.org/AutoML December 2014 – May 2015 $30,000 in prizes David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 8

  9. Why challenges work ? MOTIVATION OF ORGANIZING CONTESTS: EXTREME VALUE Courtesy : Lakhani 2014 Experts are highly skilled, trained - > more focused, performed solution, low variety OI is suitable for a variety of nonconvential surprising ideas that are « far » from traditional Not just ML, but a general trend: expertise - > high volatility Open Innovation Olga Kokshagina 2015 20 David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 9

  10. From domain to challenge and back Domain e.g. HEP Challenge simplify Problem Problem Domain The crowd experts solves solve the challenge the domain problem problem reimport Solution Solution David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 10

  11. Higgs Machine Learning Challenge

  12. … in a nutshell q Why not put some Atlas simulated data on the web and ask data scientists to find the best machine learning algorithm to find the Higgs ? o Instead of HEP people browsing machine learning papers, coding or downloading possibly interesting algorithm, trying and seeing whether it can work for our problems q Challenge for us : make a full ATLAS Higgs analysis simple for non physicists, but not too simple so that it remains useful q Also try to foster long term collaborations between HEP and ML David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 12

  13. Committees q Organization committee: { ATLAS o David Rousseau : Atlas-LAL o Claire Adam-Bourdarios : Atlas-LAL (outreach, legal matter) o Glen Cowan : Atlas-RHUL (statistics) { Learning Machine o Balazs Kegl : Appstat-LAL o Cécile Germain : TAO-LRI o Isabelle Guyon : Chalearn (challenges organisation) q Advisory committee: o Andreas Hoecker : Atlas-CERN (PC,TMVA) o Joerg Stelzer : Atlas-CERN (TMVA) o Thorsten Wengler : Atlas-CERN (ATLAS management) o Marc Schoenauer : INRIA (french computer science institute) David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 13

  14. H tautau ATLAS-CONF-2013-108 4.1 σ evidence (now superseded by paper arXiv:1501.049) David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 14

  15. How did it work ? q First idea in Sep 2012 q Challenge ran from May to September 2014 q People register to Kaggle web site hosted https://www.kaggle.com/c/higgs-boson . (additional info on https://higgsml.lal.in2p3.fr) q Open to almost any one o Data scientist o HEP physicists o Students, geeks, o Except LAL-Orsay employees (for legal reasons) q …download training dataset (with label) with 250k events q …train their own algorithm to optimise the significance (à la s/sqrt(b)) q …download test dataset (without labels) with 550k events q …upload their own classification q The site automatically calculates significance. Public (100k events) and private (450k events) leader boards update instantly. q Competition closes mid september 2014. People are asked to provide their code and methods. Best 1 2 3 from private leaderboard win 7k € 4k € 2k € q The most interesting one gets the “HEP meets ML award” Funded by: Paris Saclay Center for Data Science, Google, INRIA David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 15

  16. Dataset ASCII csv file, with mixture of Higgs to tautau signal Primitive 3-vectors allowing to compute the conf and corresponding background, from official note variables (mass neglected), GEANT4 ATLAS simulation 16 independent variables: PRI_tau_pt Weight and signal/background (for training dataset PRI_tau_eta only) PRI_tau_phi weight (fully normalised) PRI_lep_pt label : « s » or « b » PRI_lep_eta Conf note variables used for categorization or BDT: PRI_lep_phi DER_mass_MMC PRI_met DER_mass_transverse_met_lep PRI_met_phi DER_mass_vis PRI_met_sumet DER_pt_h PRI_jet_num (0,1,2,3, capped at 3) DER_deltaeta_jet_jet PRI_jet_leading_pt DER_mass_jet_jet PRI_jet_leading_eta DER_prodeta_jet_jet PRI_jet_leading_phi DER_deltar_tau_lep PRI_jet_subleading_pt DER_pt_tot PRI_jet_subleading_eta DER_sum_pt PRI_jet_subleading_phi DER_pt_ratio_lep_tau PRI_jet_all_pt DER_met_phi_centrality DER_lep_eta_centrality David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 16

  17. From domain to challenge and back Domain e.g. HEP Challenge 18 months simplify Problem Problem Domain The crowd experts solves 4 months solve the challenge the domain problem problem reimport Solution Solution >2 years ? David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 17

  18. Real analysis vs challenge 1. Systematics 1. No systematics 2. 2 categories x n BDT score bins 2. No categories, one signal region 3. Straight use of ATLAS G4 MC 3. Background estimated from data (embedded, anti tau, control 4. Weights only include region) and some MC normalisation and pythia weight. Neg. weight events 4. Weights include all corrections. rejected. Some negative weights (tt) 5. Only use variables and events 5. Potentially use any information preselected by the real analysis from all 2012 data and MC events 6. All BDT variables + categorisation variables + 6. Few variables fed in two BDT primitives 3-vector 7. Significance from “regularised 7. Significance from complete fit Asimov” with NP etc… 8. MVA “no-limit” 8. MVA with TMVA BDT Simpler, but not too simple! David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 18

  19. Participation q Big success ! q 1785 teams (1942 people) have participated (participation=submission of at least one solution) o (6517 people have downloaded the data) o è most popular challenge on the Kaggle platform, ever (Amazon.com employee access challenge 1687 teams, Allstate Purchase Prediction Challenge 1567 teams) q 35772 solutions uploaded q 136 forum topics with 1100 posts David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 19

  20. Final leaderboard 7000$ 4000$ 2000$ Best physicist HEP meets ML award XGBoost authors Free trip to CERN TMVA expert, with TMVA improvements David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 20

  21. =significance David Rousseau HiggsML and tracking challenges CTD 2015 Berkeley 21

Recommend


More recommend