software development in appstat
play

Software development in AppStat B. K egl / AppStat 1 AppStat: - PowerPoint PPT Presentation

Software development in AppStat B. K egl / AppStat 1 AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu ee Bal azs K egl Linear Accelerator Laboratory, CNRS/University of


  1. Software development in AppStat B. K´ egl / AppStat 1 AppStat: Applied Statistics and Machine Learning AppStat: Apprentissage Automatique et Statistique Appliqu´ ee Bal´ azs K´ egl Linear Accelerator Laboratory, CNRS/University of Paris Sud Service Informatique Nov 30, 2010 1

  2. B. K´ egl / AppStat 2 Overview • Introduction • me • the team • collaborations • Scientific projects → software • discriminative learning → boosting → multiboost.org • inference, Monte-Carlo integration → adaptive MCMC → integration into root (save it for next time)

  3. B. K´ egl / AppStat 3 Scientific path Hungary 1989 – 94 M.Eng. Computer Science BUTE 1994 – 95 research assistant BUTE Canada 1995 – 99 Ph.D. Computer Science Concordia U 2000 postdoc Queen’s U 2001 – 06 assistant professor U of Montreal France 2006 – research scientist (CR1) CNRS / U Paris Sud • Research interests: machine learning, pattern recognition, signal pro- cessing, applied statistics • Applications: image and music processing, bioinformatics, software en- gineering, grid control, experimental physics

  4. B. K´ egl / AppStat 4 The team B. Kégl (team leader) 2006 - - boosting - MCMC - Auger D. Benbouzid (Ph.D. student) 2010 - R. Busa-Fekete (postdoc) - boosting 2008 - """"""""""""""""""" - JEM EUSO - boosting - optimization R. Bardenet (Ph.D student) - SysBio 2009 - - MCMC - optimization - Auger � ������������������������������������������������������������� � ������������������������������������ ��������������������� � ������������������������������������ ���������������������� F-D. Collin (software D. Garcia (postdoc; 01/01/2011) engineer; 01/12/2010) - generative models � - multiboost.org - Auger / JEM EUSO - MCMC in root - tutoring - system integration � � � ����������������������������������������������������������������������������������� ������������������������������������ � � ����������������������������������� ����������������������������������������������� � ������������������������������������������������������������������������������������������ ����������������

  5. B. K´ egl / AppStat 5 Collaborations Telecom ParisTech LRI LTCI TAO MCMC boosting optimization LAL AppStat o n t i a Hungarian Academy m i z p t i o l t a i k o c c g r u d ESBG reconstruction MCMC boosting r hypothesis test e g g i boosting r t JEM EUSO Auger ILC, LSST, Computer Science etc. Experimental Science Existing link Future link

  6. B. K´ egl / AppStat 6 Funding • ANR “jeune chercheur” MetaModel • 2007–2010, 150K e • ANR “COSINUS” Siminole • 2010–2014, 1043K e (658K e at LAL) • MRM Grille Paris Sud • 2010–2012, 60K e (31K e at LAL)

  7. B. K´ egl / AppStat 7 Siminole within ANR COSINUS • COSINUS = Conception and Simulation • Theme 1: simulation and supercomputing • Theme 2: conception and optimization • Theme 3: large-scale data storage and processing • Siminole • principal theme: Theme 2 • secondary theme: Theme 1

  8. B. K´ egl / AppStat 8 Siminole within ANR COSINUS • Simulation: third pillar of scientific discovery • Improving simulation • algorithmic development inside the simulator • implementation on high-end computing devices • our approach: control the number of calls to the simulator

  9. B. K´ egl / AppStat 9 Siminole within ANR COSINUS • Optimization • simulate from f ( x ) , find max f ( x ) x • Inference • simulate from p ( x | θ ) , find p ( θ | x ) • Discriminative learning • simulate from p ( x , θ ) , find θ = f ( x )

  10. B. K´ egl / AppStat 10 Discriminative learning → boosting → multiboost.org • Discriminative learning (classification) � � • Infer f ( x ) : R d → 1 ,..., K from a database D = ( x 1 , y 1 ) ,..., ( x n , y n ) • boosting, AdaBoost • one of the state-of-the-art classification algorithms • multiboost.org • our implementation

  11. B. K´ egl / AppStat 11 Machine learning at the crossroads Artificial intelligence Probability theory Statistique Optimization Machine learning Cognitive science Signal processing Neuroscience Information theory

  12. B. K´ egl / AppStat 12 Machine Learning • From a statistical point of view • non-parametric fitting, capacity/complexity control • large dimensionality • large data sets, computational issues • mostly classification (categorization, discrimination)

  13. B. K´ egl / AppStat 13 Discriminative learning • observation vector: x ∈ R d • class label: y ∈ {− 1 , 1 } – binary classification • class label: y ∈ { 1 ,..., K } – multi-class classification • classifier: g : R d �→ {− 1 , 1 } • discriminant function: f : R d �→ [ − 1 , 1 ] � 1 , if f ( x ) ≥ 0 , g ( x ) = − 1 , if f ( x ) < 0

  14. B. K´ egl / AppStat 14 Discriminative learning • Inductive learning � � • training sample: D n = ( x 1 , y 1 ) ,..., ( x n , y n ) • function set: F � � n �→ F R d ×{− 1 , 1 } • learning algorithm: A LGO : A LGO ( D n ) → f � � • goal: small generalization error P f ( X ) � = Y

  15. B. K´ egl / AppStat 15 Data for two � class classification problem x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  16. B. K´ egl / AppStat 16 2 � D Gaussian fit for class 1 x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  17. B. K´ egl / AppStat 17 2 � D Gaussian fit for class 2 x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  18. B. K´ egl / AppStat 18 Classification • Terminology • Conditional densities: p ( x | Y = 1 ) , p ( x | Y = − 1 ) • Prior probabilities: p ( Y = 1 ) , p ( Y = − 1 ) • Posterior probabilities: p ( Y = 1 | x ) , p ( Y = − 1 | x ) • Bayes theorem: p ( Y = 1 | x ) = p ( x | Y = 1 ) p ( Y = 1 ) ∼ p ( x | Y = 1 ) p ( Y = 1 ) p ( x ) • Decision: � p ( x | Y = 1 ) p ( Y = 1 ) if p ( x | Y = − 1 ) p ( Y = − 1 ) > 1 , 1 g ( x ) = − 1 otherwise.

  19. B. K´ egl / AppStat 19 Discriminant function with Gaussian fits x 2 1000 900 800 700 600 x 1 500 5 10 15 20 25 30

  20. B. K´ egl / AppStat 20 ' Two Moons' data for two � class classification problem x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  21. B. K´ egl / AppStat 21 2 � D Gaussian fit for class 1 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  22. B. K´ egl / AppStat 22 2 � D Gaussian fit for class 2 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  23. B. K´ egl / AppStat 23 Discriminant function with Gaussian fits x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  24. B. K´ egl / AppStat 24 2 � D Parzen fit for class 1, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  25. B. K´ egl / AppStat 25 2 � D Parzen fit for class 2, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  26. B. K´ egl / AppStat 26 Discriminant function with Parzen fits, h � 0.12 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  27. B. K´ egl / AppStat 27 2 � D Parzen fit for class 1, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  28. B. K´ egl / AppStat 28 2 � D Parzen fit for class 2, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  29. B. K´ egl / AppStat 29 Discriminant function with Parzen fits, h � 0.02 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  30. B. K´ egl / AppStat 30 2 � D Parzen fit for class 1, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  31. B. K´ egl / AppStat 31 2 � D Parzen fit for class 2, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  32. B. K´ egl / AppStat 32 Discriminant function with Parzen fits, h � 3 x 2 5 4 3 2 1 0 � 1 x 1 � 1 0 1 2 3 4 5 6

  33. B. K´ egl / AppStat 33 Training and test error rates for Parzen fits with different bandwidths error rate 0.20 0.15 0.10 0.05 0.00 h 0 0.2 0.4 0.6 0.8

  34. B. K´ egl / AppStat 34 Non-parametric fitting • Capacity control, regularization • trade-off between approximation error and estimation error • complexity grows with data size • no need to correctly guess the function class

Recommend


More recommend