introduction to machine learning part i
play

Introduction to Machine Learning Part I. Mich` ele Sebag TAO: - PowerPoint PPT Presentation

Introduction to Machine Learning Part I. Mich` ele Sebag TAO: Theme Apprentissage & Optimisation http://tao.lri.fr/tiki-index.php Sept 4th, 2012 Overview Examples Introduction to Supervised Machine Learning Decision trees Empirical


  1. Introduction to Machine Learning Part I. Mich` ele Sebag TAO: Theme Apprentissage & Optimisation http://tao.lri.fr/tiki-index.php Sept 4th, 2012

  2. Overview Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

  3. Examples ◮ Vision ◮ Control ◮ Netflix ◮ Spam ◮ Playing Go ◮ Google http://ai.stanford.edu/ ∼ ang/courses.html

  4. Reading cheques LeCun et al. 1990

  5. MNIST: The drosophila of ML Classification

  6. Detecting faces

  7. The 2005-2012 Visual Object Challenges A. Zisserman, C. Williams, M. Everingham, L. v.d. Gool

  8. The supervised learning setting Input : set of ( x , y ) ◮ An instance x R D e.g. set of pixels, x ∈ I ◮ A label y in { 1 , − 1 } or { 1 , . . . , K } or I R

  9. The supervised learning setting Input : set of ( x , y ) ◮ An instance x R D e.g. set of pixels, x ∈ I ◮ A label y in { 1 , − 1 } or { 1 , . . . , K } or I R Pattern recognition ◮ Classification Does the image contain the target concept ? h : { Images } �→ { 1 , − 1 } ◮ Detection Does the pixel belong to the img of target concept? h : { Pixels in an image } �→ { 1 , − 1 } ◮ Segmentation Find contours of all instances of target concept in image

  10. The 2005 Darpa Challenge Thrun, Burgard and Fox 2005 Autonomous vehicle Stanley − Terrains

  11. The Darpa challenge and the AI agenda What remains to be done Thrun 2005 ◮ Reasoning 10% ◮ Dialogue 60% ◮ Perception 90%

  12. Robots Ng, Russell, Veloso, Abbeel, Peters, Schaal, ... Reinforcement learning Classification

  13. Robots, 2 Toussaint et al. 2010 (a) Factor graph modelling the variable interactions (b) Behaviour of the 39-DOF Humanoid: Reaching goal under Balance and Collision constraints Bayesian Inference for Motion Control and Planning

  14. Go as AI Challenge Gelly Wang 07; Teytaud et al. 2008-2011 Reinforcement Learning, Monte-Carlo Tree Search

  15. Energy policy Claim Many problems can be phrased as optimization in front of the uncertainty. Adversarial setting 2 two-player game uniform setting a single player game Management of energy stocks under uncertainty

  16. States and Decisions States ◮ Amount of stock (60 nuclear, 20 hydro.) ◮ Varying: price, weather alea or archive ◮ Decision: release water from one reservoir to another ◮ Assessment: meet the demand, otherwise buy energy PLANT Reservoir 1 Reservoir2 DEMAND PRICE Reservoir 3 NUCLEAR PLANT Reservoir 4 Lost water

  17. Netflix Challenge 2007-2008 Collaborative Filtering

  18. Collaborative filtering Input ◮ A set of users n u , ca 500,000 ◮ A set of movies n m , ca 18,000 ◮ A n m × n u matrix: person, movie, rating Very sparse matrix: 1%... Output ◮ Filling the matrix !

  19. Collaborative filtering Input ◮ A set of users n u , ca 500,000 ◮ A set of movies n m , ca 18,000 ◮ A n m × n u matrix: person, movie, rating Very sparse matrix: 1%... Output ◮ Filling the matrix ! Criterion ◮ (relative) mean square error ◮ ranking error

  20. Spam − Phishing − Scam Classification, Outlier detection

  21. The power of big data ◮ Now-casting outbreak of flu ◮ Public relations >> Advertizing

  22. Mc Luhan and Google We shape our tools and afterwards our tools shape us Marshall McLuhan, 1964 First time ever a tool is observed to modify human cognition that fast. Sparrow et al., Science 2011

  23. Overview Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

  24. Where we are Ast. series Pierre de Rosette World Natural Human−related phenomenons phenomenons Data / Principles Common Maths. Sense Modelling You are here

  25. WHERE WE ARE Sc. data World Natural Human−related phenomenons phenomenons Data / Principles Maths. Common Modelling Sense You are here

  26. Types of Machine Learning problems WORLD − DATA − USER + Rewards Observations + Target Decide Understand Predict Code Classification/Regression Policy Unsupervised Supervised Reinforcement LEARNING LEARNING LEARNING

  27. Data Example ◮ row : example/ case ◮ column : fea- ture/variables/attribute ◮ attribute : class/label Instance space X ◮ Propositionnal : R d X ≡ I ◮ Structured : sequential, spatio-temporal, relational. aminoacid

  28. Supervised Learning, notations Context Oracle World → Instance x i → ↓ y i INPUT ∼ P ( x , y ) E = { ( x i , y i ) , x i ∈ X , y i ∈ { 0 , 1 } , i = 1 . . . n } HYPOTHESIS SPACE H h : X �→ { 0 , 1 } LOSS FUNCTION ℓ : Y × Y �→ I R OUTPUT h ∗ = arg max { score ( h ) , h ∈ H}

  29. Classification and criteria Generalization Error � Err ( h ) = E [ ℓ ( y , h ( x ))] = ℓ ( y , h ( x )) dP ( x , y ) Empirical Error n Err e ( h ) = 1 � ℓ ( y i , h ( x i )) n i =1 Bound structural risk Err ( h ) < Err e ( h ) + F ( n , d ( H )) d ( H ) = Vapnik Cervonenkis dimension of H , see later

  30. The Bias-Variance Trade-off Biais Bias ( H ): error of the best hypothesis h ∗ de H Variance Variance of h n as a function of E h h h* target concept Variance h Bias H Function Space

  31. The Bias-Variance Trade-off Biais Bias ( H ): error of the best hypothesis h ∗ de H Variance Variance of h n as a function of E h h h* target concept Variance h Bias H Function Space Overfitting Test error Training error Complexity of H

  32. Key notions ◮ The main issue regarding supervised learning is overfitting. ◮ How to tackle overfitting: ◮ Before learning: use a sound criterion regularization ◮ After learning: cross-validation Case studies Summary ◮ Learning is a search problem ◮ What is the space ? What are the navigation operators ?

  33. Hypothesis Spaces Logical Spaces � � Concept ← Literal,Condition ◮ Conditions = [color = blue]; [age < 18] ◮ Condition f : X �→ { True , False } ◮ Find: disjunction of conjunctions of conditions ◮ Ex: (unions of) rectangles of the 2D-plane X .

  34. Hypothesis Spaces Numerical Spaces Concept = ( h () > 0) ◮ h ( x ) = polynomial, neural network, . . . ◮ h : X �→ I R ◮ Find: (structure and) parameters of h

  35. Hypothesis Space H Logical Space ◮ h covers one example x iff h ( x ) = True . ◮ H is structured by a partial order relation h ≺ h ′ iff ∀ x , h ( x ) → h ′ ( x ) Numerical Space H ◮ h ( x ) is a real value (more or less far from 0) ◮ we can define ℓ ( h ( x ) , y ) ◮ H is structured by a partial order relation h ≺ h ′ iff E [ ℓ ( h ( x ) , y )] < E [ ℓ ( h ′ ( x ) , y )]

  36. Hypothesis Space H / Navigation H operators Version Space Logical spec / gen Decision Trees Logical specialisation Neural Networks Numerical gradient Support Vector Machines Numerical quadratic opt. Ensemble Methods − adaptation E This course ◮ Decision Trees ◮ Support Vector Machines ◮ Ensemble methods

  37. Overview Examples Introduction to Supervised Machine Learning Decision trees Empirical validation Performance indicators Estimating an indicator

  38. Decision Trees C4.5 (Quinlan 86) ◮ Among the most widely used algorithms ◮ Easy ◮ to understand ◮ to implelement ◮ to use Age < 55 ◮ and cheap in CPU time >= 55 ◮ J48, Weka, SciKit Smoker Diabete no no yes yes RISK PATH. NORMAL Sport no yes Tension RISK low high NORMAL RISK

  39. Decision Trees

  40. Decision Trees (2) Procedure DecisionTree( E ) 1. Assume E = { ( x i , y i ) n R D , y i ∈ { 0 , 1 }} i =1 , x i ∈ I • If E single-class (i.e., ∀ i , j ∈ [1 , n ]; y i = y j ), return • If n too small (i.e., < threshold), return • Else, find the most informative attribute att 2. Forall value val of att • Set E val = E ∩ [ att = val ]. • Call DecisionTree( E val ) Criterion: information gain = Pr ( Class = 1 | att = val ) p I ([ att = val ]) = − p log p − (1 − p ) log (1 − p ) I ( att ) = � i Pr ( att = val i ) . I ([ att = val i ])

  41. Decision Trees (3) Contingency Table Quantity of Information (QI) Quantity of Information 0.7 0.5 QI 0.3 0.1 0.1 0.3 0.5 0.7 0.9 p Computation value p(value) p(poor | value) QI (value) p(value) * QI (value) [0,10[ 0.051 0.999 0.00924 0.000474 [10,20[ 0.25 0.938 0.232 0.0570323 [20,30[ 0.26 0.732 0.581 0.153715

  42. Decision Trees (4) Limitations ◮ XOR-like attributes ◮ Attributes with many values ◮ Numerical attributes ◮ Overfitting

  43. Limitations Numerical Attributes ◮ Order the values val 1 < . . . < val t ◮ Compute QI([ att < val i ]) ◮ QI(att) = max i QI([ att < val i ]) The XOR case Bias the distribution of the examples

  44. Complexity Quantity of information of an attribute n ln n Adding a node D × n ln n

  45. Tackling Overfitting Penalize the selection of an already used variable ◮ Limits the tree depth. Do not split subsets below a given minimal size ◮ Limits the tree depth. Pruning ◮ Each leaf, one conjunction; ◮ Generalization by pruning litterals; ◮ Greedy optimization, QI criterion.

  46. Decision Trees, Summary Still around after all these years ◮ Robust against noise and irrelevant attributes ◮ Good results, both in quality and complexity Random Forests Breiman 00

Recommend


More recommend