introduction to machine learning
play

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 - PowerPoint PPT Presentation

Introduction to Machine Learning Andrea De Lorenzo A.Y.2020 1/122 Section 1 General information 2/122 Lecturers Andrea De Lorenzo Dipartimento di Ingegneria e Architettura (DIA) http://delorenzo.inginf.units.it/ 3/122 Course


  1. Notation and terminology Different communities (e.g., statistical learning vs. machine learning vs. artificial intelligence) use different terms and notation: ◮ x ( i ) instead of x i , j (hence x ( i ) instead of x i ) j ◮ m instead of n and n instead of p ◮ . . . Focus on the meaning! 24/122

  2. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 3 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  3. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. 2 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  4. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 Sepal length I. setosa I. versicolor 25/122

  5. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length I. setosa I. versicolor 25/122

  6. Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 solution : 4 5 6 7 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 25/122

  7. “A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? Q: a model of what? 26/122

  8. Choosing the model The choice of the model/tool/technique to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see some options. 27/122

  9. Comparing many models Experimentally: does the model work well on (new) data? 28/122

  10. Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . ◮ Q: what’s the difference? We will see/discuss some options. 28/122

  11. It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 29/122

  12. ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 30/122

  13. Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more ML technique ◮ assess learned models “By hands?” No, with software! 31/122

  14. ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 32/122

  15. ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 33/122

  16. ML/DM software: why? In all cases, sw allows to be more productive and concise. E.g., learn and use a model for classification, in Java+Smile: double[][] instances = ...; 1 int[] labels = ...; 2 RandomForest classifier = (new RandomForest.Trainer()).train( 3 instances, labels); double[] newInstance = ...; 4 int newLabel = classifier.predict(newInstance); 5 In R: d = ... 1 classifier = randomForest(label~., d) 2 newD = ... 3 newLabels = predict(classifier, newD) 4 34/122

  17. Section 3 Plotting data: an overview 35/122

  18. Advanced plotting ◮ many packages (e.g., ggplot2) ◮ many options Which is the most proper chart to support a thesis? 36/122

  19. Aim of a plot: examples 37/122

  20. Aim of a plot: examples 38/122

  21. Aim of a plot: examples 39/122

  22. Aim of a plot: examples 40/122

  23. Section 4 Tree-based methods 41/122

  24. The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 42/122

  25. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 43/122

  26. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 43/122

  27. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 43/122

  28. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 43/122

  29. Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 43/122

  30. How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 44/122

  31. How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 45/122

  32. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 46/122

  33. Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: “how” can different methods differ? 47/122

  34. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 48/122

  35. Best branch criteria Classification error E () works, but has been shown to be “not sufficiently sensitive for tree-growing”. E ( y ) = |{ y ∈ y : y � = ˆ y }| |{ y ∈ y : y = c }| = 1 − max = 1 − max p y , c | y | | y | c c Other two option: ◮ Gini index � G ( y ) = p y , c (1 − p y , c ) c ◮ Cross-entropy � D ( y ) = − p y , c log p y , c c For all indexes, the lower the better ( node impurity ). 49/122

  36. Best branch criteria: binary classification Class. error E Gini index G 0 . 4 Cross-entropy D Index · ( y ) 0 . 2 0 0 0 . 2 0 . 4 0 . 6 0 . 8 1 p y , c Cross-entropy is rescaled. Q: what happens with multiclass problems? 50/122

  37. Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 51/122

  38. Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? (recall: “a model of what?”) 52/122

  39. Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity is not related (only) with k min , but also with data 53/122

  40. Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 54/122

  41. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 55/122

  42. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 55/122

  43. High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 55/122

  44. Spotting overfitting Learning error Error Model complexity Test error: error on unseen data 56/122

  45. Spotting overfitting Learning error Test error Error Model complexity Test error: error on unseen data 56/122

  46. k -fold cross-validation Where can I find “unseen data”? Pretend to have it! 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize beyond available data? ◮ how the learned artifact will behave on unseen data? 57/122

  47. k -fold cross-validation folding 1 error 1 folding 2 error 2 folding 3 error 3 folding 4 error 4 folding 5 error 5 i = k error = 1 � error i k i =1 Or with any other meaningful (effectiveness) measure Q: how should data be split? 58/122

  48. Fighting overfitting with trees ◮ large k min (large w.r.t. what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune 59/122

  49. Pruning: high level idea 1. learn a full tree t 0 2. build from t 0 a sequence T = { t 0 , t 1 , . . . , t n } of trees such that ◮ t i is a root-subtree of t i − 1 ( t i ⊂ t i − 1 ) ◮ t i is always less complex than t i − 1 3. choose the t ∈ T with minimum classification error with k -fold cross-validation 60/122

  50. k -fold cross-validation: data splitting Q: how should data be split? Example: Android Malware detection ◮ Gerardo Canfora et al. “Effectiveness of opcode ngrams for detection of multi family android malware”. In: Availability, Reliability and Security (ARES), 2015 10th International Conference on . IEEE. 2015, pp. 333–340 ◮ Gerardo Canfora et al. “Detecting android malware using sequences of system calls”. In: Proceedings of the 3rd International Workshop on Software Development Lifecycle for Mobile . ACM. 2015, pp. 13–20 61/122

  51. Using cross-validation (CV) for assessment (I) How the learned artifact will behave on unseen data? More precisely: How an artifact learned with this learning technique will behave on unseen data? 62/122

  52. Using CV for assessment (II) “This learning technique” = BuildDecisionTree () with k min = 10 1. repeat k times 1.1 BuildDecisionTree () with k min = 10 on all but one slice k − 1 ◮ k n observations in each X passed to BuildDecisionTree () 1.2 compute classification error on left out slice 2. average computed classification errors k invocations of BuildDecisionTree () 63/122

  53. Using CV for assessment (III) “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV For assessing this technique, we do two nested CVs: 1. repeat k times 1.1 choose k min among m values with 10-CV (repeat BuildDecisionTree () 10 m times) on all but one slice k − 1 9 ◮ 10 n observations in each X passed to k BuildDecisionTree ()! 1.2 compute classification error on left out slice ◮ usually, a new tree is built on k − 1 k n observations 2. average computed classification errors (10 m + 1) k invocations of BuildDecisionTree () 64/122

  54. Using CV for assessment: “cheating” “This learning technique” = BuildDecisionTree () with k min chosen automatically with a 10-fold CV Using just one CV is cheating (cherry picking)! ◮ k min is chosen exactly to minimize error on the full dataset ◮ conceptually, this way of “fitting” k min is similar to the way we build the tree 65/122

  55. Subsection 1 Regression trees 66/122

  56. Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 67/122

  57. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

  58. Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 68/122

  59. Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Q: what should we change? 69/122

  60. Best branch function BestBranch ( X , y ) y ) 2 + � y ) 2 ( i ⋆ , t ⋆ ) ← arg min i , t � y i ∈ y | x i ≥ t ( y i − ¯ y i ∈ y | x i < t ( y i − ¯ return ( i ⋆ , t ⋆ ) end function Q: what should we change? Minimize sum of residual sum of squares (RSS) (the two ¯ y are different) 69/122

  61. Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

  62. Stopping criterion function ShouldStop ( y ) if RSS is 0 then return true else if | y | < k min then return true else return false end if end function Q: what should we change? 70/122

  63. Interpretation 4 2 0 0 5 10 15 20 25 30 71/122

Recommend


More recommend