introduction to machine learning
play

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - PowerPoint PPT Presentation

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento di Ingegneria e


  1. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 4 5 6 7 solution : 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 20/77

  2. “A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? 21/77

  3. Choosing the model The choice of the model/tool/algorithm to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see many options. 22/77

  4. Comparing many models Experimentally: does the model work well on (new) data? 23/77

  5. Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . We will see/discuss many options. 23/77

  6. It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 24/77

  7. ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 25/77

  8. Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more learning algorithms ◮ assess learned models “By hands?” No, with software! 26/77

  9. ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 27/77

  10. ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 28/77

  11. Section 2 Tree-based methods 29/77

  12. The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 30/77

  13. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 31/77

  14. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 31/77

  15. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 31/77

  16. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 31/77

  17. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 31/77

  18. How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 32/77

  19. How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 33/77

  20. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 34/77

  21. Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: different “how” can differ? how? 35/77

  22. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 36/77

  23. Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 37/77

  24. Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? 38/77

  25. Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity issue is not related (only) with k min 39/77

  26. Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 40/77

  27. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 41/77

  28. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 41/77

  29. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 41/77

  30. Fighting overfitting ◮ large k min (large w.r.t what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune (bias, variance will be detailed later) 42/77

  31. Evaluation: k -fold cross-validation How to estimate the predictor performance on new (unavailable) data? 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize on available data? ◮ how the learned artifact will behave on unseen data? 43/77

  32. Evaluation: k -fold cross-validation folding 1 accuracy 1 folding 2 accuracy 2 folding 3 accuracy 3 folding 4 accuracy 4 folding 5 accuracy 5 i = k accuracy = 1 � accuracy i k i =1 Or with classification error rate or any other meaningful (effectiveness) measure Q: how should data be split? 44/77

  33. Subsection 1 Regression trees 45/77

  34. Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 46/77

  35. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

  36. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

  37. Interpretation 4 2 0 0 5 10 15 20 25 30 48/77

  38. Regression and overfitting Image from F. Daolio 49/77

  39. Trees in summary Pros: � easily interpretable/explicable � learning and regression/classification easily understandable � can handle both numeric and categorical values Cons: � not so accurate ( Q: always?) 50/77

  40. Tree accuracy? Image from An Introduction to Statistical Learning 51/77

  41. Subsection 2 Trees aggregation 52/77

  42. Weakness of the tree 30 25 20 15 0 20 40 60 80 100 Small tree: Big tree: ◮ low complexity ◮ high complexity ◮ will hardly fit the “curve” ◮ may overfit the noise on the part right part ◮ high bias, low variance ◮ low bias, high variance 53/77

  43. The trees view Small tree: Big tree: ◮ “a car is something that ◮ “a car is a made-in-Germany moves” blue object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine” 54/77

  44. Big tree view A big tree: ◮ has a detailed view of the learning data (high complexity) ◮ “trusts too much” the learning data (high variance) What if we “combine” different big tree views and ignore details on which they disagree? 55/77

  45. Wisdom of the crowds What if we “combine” different big tree views and ignore details on which they disagree? ◮ many views ◮ independent views ◮ aggregation of views ≈ the wisdom of the crowds : a collective opinion may be better than a single expert’s opinion 56/77

  46. Wisdom of the trees ◮ many views ◮ independent views ◮ aggregation of views 57/77

  47. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views 57/77

  48. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

  49. Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ ??? learning is deterministic: same data ⇒ same tree ⇒ same view ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

  50. Independent views Independent views ≡ different points of view ≡ different learning data But we have only one learning data! 58/77

  51. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions 59/77

  52. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) 59/77

  53. Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) Bagging of trees ( bootstrap , more in general) 59/77

  54. Tree bagging When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction For classification, other aggregations can be done: majority voting (most common) is the simplest 60/77

  55. How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) 61/77

  56. How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) ◮ it has been shown (experimentally) that ◮ for “large” B , bagging is better than single tree ◮ increasing B does not cause overfitting ◮ (for us: default B is ok! “large” ≈ hundreds) Q: how better? at which cost? 61/77

  57. Bagging · 10 − 2 8 7 Test error 6 5 0 100 200 300 400 500 Number B of trees 62/77

  58. Independent view: improvement Despite being learned on different samples, bagging trees may be correlated, hence views are not very independent ◮ e.g., one variable is much more important than others for predicting ( strong predictor ) Idea: force point of view differentiation by “hiding” variables 63/77

  59. Random forest When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 consider only m on p independent variables 1.3 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction ◮ (observations and) variables are random ly chosen. . . ◮ . . . to learn a forest of trees Q: are missing variables a problem? 64/77

  60. Random forest: parameter m How to choose the value for m ? ◮ m = p → bagging ◮ it has been shown (experimentally) that ◮ m does not relate with overfitting ◮ m = √ p is good for classification ◮ m = p 3 is good for regression ◮ (for us, default m is ok!) 65/77

  61. Random forest Experimentally shown: one of the “best” multi-purpose supervised classification methods ◮ Manuel Fern´ andez-Delgado et al. “Do we need hundreds of classifiers to solve real world classification problems”. In: J. Mach. Learn. Res 15.1 (2014), pp. 3133–3181 but. . . 66/77

  62. No free lunch! “Any two optimization algorithms are equivalent when their performance is averaged across all possible problems” ◮ David H Wolpert. “The lack of a priori distinctions between learning algorithms”. In: Neural computation 8.7 (1996), pp. 1341–1390 Why free lunch? ◮ many restaurants, many items on menus, many possibly prices for each item: where to go to eat? ◮ no general answer ◮ but, if you are a vegan, or like pizza, then a best choice could exist Q: problem? algorithm? 67/77

  63. Nature of the prediction Consider classification: ◮ tree → the class ◮ forest → the class, as resulting from a voting 68/77

  64. Nature of the prediction Consider classification: ◮ tree → the class ◮ “virginica” is just “virginica” ◮ forest → the class, as resulting from a voting ◮ “241 virginica, 170 versicolor, 89 setosa” is different than “478 virginica, 10 versicolor, 2 setosa” Is this information useful/exploitable? 68/77

Recommend


More recommend