Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - PowerPoint PPT Presentation

Iris: visual interpretation Simplification: forget petal and 5 I. virginica → 2 variables, 2 species (binary classification Sepal width problem). 4 ◮ Problem : given any new observation, we want to 3 automatically assign the species. ◮ Sketch of a possible 2 4 5 6 7 solution : 1. learn a model (classifier) Sepal length 2. “use” model on new observations I. setosa I. versicolor 20/77

“A” model? There could be many possible models: ◮ how to choose? ◮ how to compare? 21/77

Choosing the model The choice of the model/tool/algorithm to be used is determined by many factors: ◮ Problem size ( n and p ) ◮ Availability of an output variable ( y ) ◮ Computational effort (when learning or “using”) ◮ Explicability of the model ◮ . . . We will see many options. 22/77

Comparing many models Experimentally: does the model work well on (new) data? 23/77

Comparing many models Experimentally: does the model work well on (new) data? Define “works well”: ◮ a single performance index? ◮ how to measure? ◮ repeatability/reproducibility. . . We will see/discuss many options. 23/77

It does not work well. . . Why? ◮ the data is not informative ◮ the data is not representative ◮ the data has changed ◮ the data is too noisy We will see/discuss these issues. 24/77

ML is not magic Problem : find birth town from height/weight. 200 Trieste Udine 180 Height [cm] 160 140 60 70 80 90 100 Weight [kg] Q: which is the data issue here? 25/77

Implementation When “solving” a problem, we usually need: ◮ explore/visualize data ◮ apply one or more learning algorithms ◮ assess learned models “By hands?” No, with software! 26/77

ML/DM software Many options: ◮ libraries for general purpose languages: ◮ Java: e.g., http://haifengl.github.io/smile/ ◮ Python: e.g., http://scikit-learn.org/stable/ ◮ . . . ◮ specialized sw environments: ◮ Octave: https://en.wikipedia.org/wiki/GNU_Octave ◮ R: https: //en.wikipedia.org/wiki/R_(programming_language) ◮ from scratch 27/77

ML/DM software: which one? ◮ production/prototype ◮ platform constraints ◮ degree of (data) customization ◮ documentation availability/community size ◮ . . . ◮ previous knowledge/skills 28/77

Section 2 Tree-based methods 29/77

The carousel robot attendant Problem : replace the carousel attendant with a robot which automatically decides who can ride the carousel. 30/77

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride Can ride Height h [cm] 150 100 5 10 15 Age a [year] 31/77

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] 150 100 5 10 15 Age a [year] 31/77

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 100 5 10 15 Age a [year] 31/77

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 5 10 15 Age a [year] 31/77

Carousel: data Observed human attendant’s decisions. How can the robot take 200 the decision? Cannot ride ◮ if younger than 10 → Can ride can’t! Height h [cm] ◮ otherwise: 150 ◮ if shorter than 120 → can’t! ◮ otherwise → can! 100 Decision tree! a < 10 5 10 15 T F Age a [year] h < 120 T F 31/77

How to build a decision tree Dividi-et-impera (recursively): ◮ find a cut variable and a cut value ◮ for left-branch, dividi-et-impera ◮ for right-branch, dividi-et-impera 32/77

How to build a decision tree: detail Recursive binary splitting function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function ◮ Recursive binary splitting ◮ Top down (start from the “big” problem) 33/77

Best branch function BestBranch ( X , y ) ( i ⋆ , t ⋆ ) ← arg min i , t E ( y | x i ≥ t ) + E ( y | x i < t ) return ( i ⋆ , t ⋆ ) end function Classification error on subset: E ( y ) = |{ y ∈ y : y � = ˆ y }| | y | y = the most common class in y ˆ ◮ Greedy (choose split to minimize error now, not in later steps) 34/77

Best branch ( i ⋆ , t ⋆ ) ← arg min E ( y | x i ≥ t ) + E ( y | x i < t ) i , t The formula say what is done, not how is done! Q: different “how” can differ? how? 35/77

Stopping criterion function ShouldStop ( y ) if y contains only one class then return true else if | y | < k min then return true else return false end if end function Other possible criterion: ◮ tree depth larger than d max 36/77

Categorical independent variables ◮ Trees can work with categorical variables ◮ Branch node is x i = c or x i ∈ C ′ ⊂ C ( c is a class) ◮ Can mix categorical and numeric variables 37/77

Stopping criterion: role of k min Suppose k min = 1 (never stop for y size) 200 Cannot ride h < 120 Can ride Height h [cm] a < 9 . 0 a < 10 150 a < 9 . 6 a < 9 . 1 100 a < 9 . 4 5 10 15 Age a [year] Q: what’s wrong? 38/77

Tree complexity When the tree is “too complex” ◮ less readable/understandable/explicable ◮ maybe there was noise into the data Q: what’s noise in carousel data? Tree complexity issue is not related (only) with k min 39/77

Tree complexity: other interpretation ◮ maybe there was noise into the data The tree fits the learning data too much: ◮ it overfits ( overfitting ) ◮ does not generalize (high variance : model varies if learning data varies) 40/77

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S 41/77

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S 41/77

High variance “model varies if learning data varies”: what? why data varies? ◮ learning data is about the system/phenomenon/nature S ◮ a collection of observations of S ◮ a point of view on S ◮ learning is about understanding/knowing/explaining S ◮ if I change the point of view on S , my knowledge about S should remain the same! 41/77

Fighting overfitting ◮ large k min (large w.r.t what?) ◮ when building, limit depth ◮ when building, don’t split if low overall impurity decrease ◮ after building, prune (bias, variance will be detailed later) 42/77

Evaluation: k -fold cross-validation How to estimate the predictor performance on new (unavailable) data? 1. split learning data ( X and y ) in k equal slices (each of n k observations/elements) 2. for each split (i.e., each i ∈ { 1 , . . . , k } ) 2.1 learn on all but k -th slice 2.2 compute classification error on unseen k -th slice 3. average the k classification errors In essence: ◮ can the learner generalize on available data? ◮ how the learned artifact will behave on unseen data? 43/77

Evaluation: k -fold cross-validation folding 1 accuracy 1 folding 2 accuracy 2 folding 3 accuracy 3 folding 4 accuracy 4 folding 5 accuracy 5 i = k accuracy = 1 � accuracy i k i =1 Or with classification error rate or any other meaningful (effectiveness) measure Q: how should data be split? 44/77

Subsection 1 Regression trees 45/77

Regression with trees Trees can be used for regression, instead of classification. decision tree vs. regression tree 46/77

Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← most common class in y ˆ return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

Tree building: decision → regression function BuildDecisionTree ( X , y ) if ShouldStop ( y ) then y ← ¯ ˆ y ⊲ mean y return new terminal node with ˆ y else ( i , t ) ← BestBranch ( X , y ) n ← new branch node with ( i , t ) append child BuildDecisionTree ( X | x i < t , y | x i < t ) to n append child BuildDecisionTree ( X | x i ≥ t , y | x i ≥ t ) to n return n end if end function Q: what should we change? 47/77

Interpretation 4 2 0 0 5 10 15 20 25 30 48/77

Regression and overfitting Image from F. Daolio 49/77

Trees in summary Pros: � easily interpretable/explicable � learning and regression/classification easily understandable � can handle both numeric and categorical values Cons: � not so accurate ( Q: always?) 50/77

Tree accuracy? Image from An Introduction to Statistical Learning 51/77

Subsection 2 Trees aggregation 52/77

Weakness of the tree 30 25 20 15 0 20 40 60 80 100 Small tree: Big tree: ◮ low complexity ◮ high complexity ◮ will hardly fit the “curve” ◮ may overfit the noise on the part right part ◮ high bias, low variance ◮ low bias, high variance 53/77

The trees view Small tree: Big tree: ◮ “a car is something that ◮ “a car is a made-in-Germany moves” blue object with 4 wheels, 2 doors, chromed fenders, curved rear enclosing engine” 54/77

Big tree view A big tree: ◮ has a detailed view of the learning data (high complexity) ◮ “trusts too much” the learning data (high variance) What if we “combine” different big tree views and ignore details on which they disagree? 55/77

Wisdom of the crowds What if we “combine” different big tree views and ignore details on which they disagree? ◮ many views ◮ independent views ◮ aggregation of views ≈ the wisdom of the crowds : a collective opinion may be better than a single expert’s opinion 56/77

Wisdom of the trees ◮ many views ◮ independent views ◮ aggregation of views 57/77

Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views 57/77

Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

Wisdom of the trees ◮ many views ◮ just use many trees ◮ independent views ◮ ??? learning is deterministic: same data ⇒ same tree ⇒ same view ◮ aggregation of views ◮ just average prediction (regression) or take most common prediction (classification) 57/77

Independent views Independent views ≡ different points of view ≡ different learning data But we have only one learning data! 58/77

Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions 59/77

Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) 59/77

Independent views: idea! Like in cross-fold, consider only a part of the data, but: ◮ instead of a subset ◮ a sample with repetitions X = ( x T 1 x T 2 x T 3 x T 4 x T 5 ) original learning data X 1 = ( x T 1 x T 5 x T 3 x T 2 x T 5 ) sample 1 X 2 = ( x T 4 x T 2 x T 3 x T 1 x T 1 ) sample 2 X i = . . . sample i ◮ ( y omitted for brevity) ◮ learning data size is not a limitation (differently than with subset) Bagging of trees ( bootstrap , more in general) 59/77

Tree bagging When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction For classification, other aggregations can be done: majority voting (most common) is the simplest 60/77

How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) 61/77

How many trees? B is a parameter: ◮ when there is a parameter, there is the problem of finding a good value ◮ remember k min , depth ( Q: impact on?) ◮ it has been shown (experimentally) that ◮ for “large” B , bagging is better than single tree ◮ increasing B does not cause overfitting ◮ (for us: default B is ok! “large” ≈ hundreds) Q: how better? at which cost? 61/77

Bagging · 10 − 2 8 7 Test error 6 5 0 100 200 300 400 500 Number B of trees 62/77

Independent view: improvement Despite being learned on different samples, bagging trees may be correlated, hence views are not very independent ◮ e.g., one variable is much more important than others for predicting ( strong predictor ) Idea: force point of view differentiation by “hiding” variables 63/77

Random forest When learning: 1. Repeat B times 1.1 take a sample of the learning data 1.2 consider only m on p independent variables 1.3 learn a tree (unpruned) When predicting: 1. Repeat B times 1.1 get a prediction from i th learned tree 2. predict the average (or most common) prediction ◮ (observations and) variables are random ly chosen. . . ◮ . . . to learn a forest of trees Q: are missing variables a problem? 64/77

Random forest: parameter m How to choose the value for m ? ◮ m = p → bagging ◮ it has been shown (experimentally) that ◮ m does not relate with overfitting ◮ m = √ p is good for classification ◮ m = p 3 is good for regression ◮ (for us, default m is ok!) 65/77

Random forest Experimentally shown: one of the “best” multi-purpose supervised classification methods ◮ Manuel Fern´ andez-Delgado et al. “Do we need hundreds of classifiers to solve real world classification problems”. In: J. Mach. Learn. Res 15.1 (2014), pp. 3133–3181 but. . . 66/77

No free lunch! “Any two optimization algorithms are equivalent when their performance is averaged across all possible problems” ◮ David H Wolpert. “The lack of a priori distinctions between learning algorithms”. In: Neural computation 8.7 (1996), pp. 1341–1390 Why free lunch? ◮ many restaurants, many items on menus, many possibly prices for each item: where to go to eat? ◮ no general answer ◮ but, if you are a vegan, or like pizza, then a best choice could exist Q: problem? algorithm? 67/77

Nature of the prediction Consider classification: ◮ tree → the class ◮ forest → the class, as resulting from a voting 68/77

Nature of the prediction Consider classification: ◮ tree → the class ◮ “virginica” is just “virginica” ◮ forest → the class, as resulting from a voting ◮ “241 virginica, 170 versicolor, 89 setosa” is different than “478 virginica, 10 versicolor, 2 setosa” Is this information useful/exploitable? 68/77

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 - PowerPoint PPT Presentation

Introduction to Machine Learning Eric Medvet 16/3/2017 1/77 Outline Machine Learning: what and why? Motivating example Tree-based methods Regression trees Trees aggregation 2/77 Teachers Eric Medvet Dipartimento di Ingegneria e

Introduction to Machine Learning Introduction to Machine Learning Introduction to Machine

Quantum Machine Learning Adam Brown, HEP-AI Quantum Computing Machine Learning Quantum

MICROSOFT AZURE MACHINE LEARNING Oscar Naim Microsoft Microsoft Azure Machine Learning What is

MACHINE LEARNING Overview 1 1 APPLIED MACHINE LEARNING 2011-2012 APPLIED MACHINE LEARNING

MACHINE LEARNING kernels 1 MACHINE LEARNING 2012 MACHINE LEARNING Kernels: Intuition How

A Machine Learning Approach A Machine Learning Approach A Machine Learning Approach A Machine

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

INTRODUCTION TO MACHINE LEARNING Joseph C. Osborn CS 51A Spring 2020 Machine Learning is

Introduction to Machine Learning COMPSCI 371D Machine Learning COMPSCI 371D Machine

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

Human and Machine Learning Tom Mitchell Machine Learning Department Carnegie Mellon University

Machine Learning Algorithms for Classification Machine Learning Algorithms for Classification

Machine Learning - Intro Aarti Singh Machine Learning 10-701/15-781 Sept 8, 2010 You tell me

MACHINE LEARNING Kernel Canonical Correlation Analysis 1 ADVANCED MACHINE LEARNING ADVANCED

Machine learning for finance Nathan George Data Science Professor DataCamp Machine Learning

APPLIED MACHINE LEARNING Methods for Clustering K-means, Soft K-means DBSCAN 1 MACHINE

Graham Update January 29, 2010 DSS Webinar Graham Sketch Door How to Order Standard

Resident/family Support Monthly Neighborhood Outdoor Dinner Food Provided by Corvias

High resolution frequency counters E. Rubiola FEMTO-ST Institute, CNRS and Universit de

Resolution 1 Clause representation of CNF formulas CNF: ( L 1 , 1 . . . L 1 , n 1 )

Physical Considerations on the Schelling model of Social Segregation Sergio Rica Universidad

A Jour urne ney of f LA LArTP TPC Te Technology Aleena Rafique Argonne National Laboratory

Case 3:16-md-02741-VC Document 331 Filed 06/08/17 Page 1 of 3 1 Aimee Wagstaff, SBN 278480

1 Residence time: time it take for one allele to replace another in a population; i.e., the