Supervised Learning Decision Trees and Linear Models Marco - PowerPoint PPT Presentation

Lecture 10 Supervised Learning Decision Trees and Linear Models Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig

Decision Trees k -Nearest Neighbor Course Overview Linear Models ✔ Introduction Learning ✔ Artificial Intelligence Supervised ✔ Intelligent Agents Decision Trees, Neural Networks ✔ Search Learning Bayesian Networks ✔ Uninformed Search Unsupervised ✔ Heuristic Search EM Algorithm ✔ Uncertain knowledge and Reinforcement Learning Reasoning Games and Adversarial Search ✔ Probability and Bayesian Minimax search and approach Alpha-beta pruning ✔ Bayesian Networks Multiagent search ✔ Hidden Markov Chains ✔ Kalman Filters Knowledge representation and Reasoning Propositional logic First order logic Inference Plannning 2

Decision Trees k -Nearest Neighbor Machine Learning Linear Models What? Parameters, network structure, hidden concepts, What from? inductive + unsupervised, reinforcement, supervised What for? prediction, diagnosis, summarization How? passive vs active, online vs offline Type of outputs regression, classification Details generative, discriminative 3

Decision Trees k -Nearest Neighbor Supervised Learning Linear Models Given a training set of N example input-output pairs { ( x 1 , y 1 ) , ( x 2 , y 2 ) , . . . , ( x N , y N ) } where each y 1 was generated by an unknwon function y = f ( x ) , find a hypothesis function h from an hypothesis space H that approximates the true function f Measure the accuracy of the hypotheis on a test set made of new examples. We aim a good generalization 4

Decision Trees k -Nearest Neighbor Supervised Learning Linear Models Construct/adjust h to agree with f on training set ( h is consistent if it agrees with f on all examples) E.g., curve fitting: f(x) x Ockham’s razor: maximize a combination of consistency and simplicity 5

Decision Trees k -Nearest Neighbor Linear Models if we have a probability on the hypothesis: h ∗ = argmax h ∈H Pr ( h | data ) = argmax h H Pr ( data | h ) Pr ( h ) Trade off between the expressiveness of a hypothesis space and the complexity of finding a good hypothesis within that space. 6

Decision Trees k -Nearest Neighbor Outline Linear Models 1. Decision Trees 2. k -Nearest Neighbor 3. Linear Models 7

Decision Trees k -Nearest Neighbor Learning Decision Trees Linear Models A decision tree of a pair ( x , y ) represents a function that takes the input attribute x (Boolean, discrete, continuous) and outputs a simple Boolean y . E.g., situations where I will/won’t wait for a table. Training set: Attributes Target Alt Bar Fri Hun Pat Price Rain Res Type Est WillWait Example X 1 T F F T Some $$$ F T French 0–10 T X 2 T F F T Full $ F F Thai 30–60 F X 3 F T F F Some $ F F Burger 0–10 T X 4 T F T T Full $ F F Thai 10–30 T X 5 T F T F Full $$$ F T French > 60 F X 6 F T F T Some $$ T T Italian 0–10 T X 7 F T F F None $ T F Burger 0–10 F X 8 F F F T Some $$ T T Thai 0–10 T X 9 F T T F Full $ T F Burger > 60 F X 10 T T T T Full $$$ F T Italian 10–30 F X 11 F F F F None $ F F Thai 0–10 F X 12 T T T T Full $ F F Burger 30–60 T Classification of examples positive (T) or negative (F) 8

Decision Trees k -Nearest Neighbor Decision trees Linear Models One possible representation for hypotheses E.g., here is the “true” tree for deciding whether to wait: Patrons? None Some Full F T WaitEstimate? >60 30−60 10−30 0−10 F Alternate? Hungry? T No Yes No Yes Reservation? Fri/Sat? T Alternate? No Yes No Yes No Yes Bar? T F T T Raining? No Yes No Yes F T F T 9

Decision Trees k -Nearest Neighbor Example Linear Models 10

Decision Trees k -Nearest Neighbor Example Linear Models 11

Decision Trees k -Nearest Neighbor Expressiveness Linear Models Decision trees can express any function of the input attributes. E.g., for Boolean functions, truth table row → path to leaf: A A B A xor B F T F F F B B F T T F T F T T F T T T F F T T F Trivially, there is a consistent decision tree for any training set w/ one path to leaf for each example (unless f nondeterministic in x ) but it probably won’t generalize to new examples Prefer to find more compact decision trees 12

Decision Trees k -Nearest Neighbor Hypothesis spaces Linear Models How many distinct decision trees with n Boolean attributes?? = number of Boolean functions = number of distinct truth tables with 2 n rows = 2 2 n functions E.g., with 6 Boolean attributes, there are 18,446,744,073,709,551,616 trees More expressive hypothesis space – increases chance that target function can be expressed – increases number of hypotheses consistent w/ training set = ⇒ may get worse predictions There is no way to search the smallest consistent tree among 2 2 n . 13

Decision Trees k -Nearest Neighbor Heuristic approach Linear Models Greedy divide-and-conquer: test the most important attribute first divide the problem up into smaller subproblems that can be solved recursively function DTL( examples, attributes, default ) returns a decision tree if examples is empty then return default else if all examples have the same classification then return the classification else if attributes is empty then return Plurality_Value( examples ) else best ← Choose-Attribute( attributes , examples ) tree ← a new decision tree with root test best for each value v i of best do examples i ← { elements of examples with best = v i } subtree ← DTL( examples i , attributes − best , Mode( examples )) add a branch to tree with label v i and subtree subtree return tree 14

Decision Trees k -Nearest Neighbor Choosing an attribute Linear Models Idea: a good attribute splits the examples into subsets that are (ideally) “all positive” or “all negative” Patrons? Type? None Some Full French Italian Thai Burger Patrons ? is a better choice—gives information about the classification 15

Decision Trees k -Nearest Neighbor Information Linear Models The more clueless I am about the answer initially, the more information is contained in the answer 0 bits to answer a query on a coin with only head 1 bit to answer query to a Boolean question with prior � 0 . 5 , 0 . 5 � 2 bits to answer a query on a fair die with 4 faces a query on a coin with 99% probability of returing head brings less information than the query on a fair coin. Shannon formalized this concept with the concept of entropy. For a random variable X with values x k and probability Pr ( x k ) has entropy: � H ( X ) = − Pr ( x k ) log 2 Pr ( x k ) k 16

Suppose we have p positive and n negative examples is a training set, then the entropy is H ( � p / ( p + n ) , n / ( p + n ) � ) E.g., for 12 restaurant examples, p = n = 6 so we need 1 bit to classify a new example information of the table An attribute A splits the training set E into subsets E 1 , . . . , E d , each of which (we hope) needs less information to complete the classification Let E i have p i positive and n i negative examples � H ( � p i / ( p i + n i ) , n i / ( p i + n i ) � ) bits needed to classify a new example on that branch � expected entropy after branching is p i + n i � Remainder ( A ) = p + n H ( � p i / ( p i + n i ) , n i / ( p i + n i ) � ) i The information gain from attribute A is Gain ( A ) = H ( � p / ( p + n ) , n / ( p + n ) � ) − Remainder ( A ) = ⇒ choose the attribute that maximizes the gain

Decision Trees k -Nearest Neighbor Example contd. Linear Models Decision tree learned from the 12 examples: Patrons? None Some Full F T Hungry? Yes No Type? F French Italian Thai Burger T F Fri/Sat? T No Yes F T Substantially simpler than “true” tree—a more complex hypothesis isn’t justified by small amount of data 18

Decision Trees k -Nearest Neighbor Performance measurement Linear Models Learning curve = % correct on test set as a function of training set size Restaurant data; graph averaged over 20 trials 19

Decision Trees k -Nearest Neighbor Overfitting and Pruning Linear Models Pruning by statistical testing under the null hyothesis expected numbers, ˆ p k and ˆ n k : p k = p · p k + n k n k = n · p k + n k ˆ ˆ p + n p + n d ( p k − ˆ p k ) 2 + ( n k − ˆ n k ) 2 � ∆ = ˆ ˆ p k n k k = 1 χ 2 distribution with p + n − 1 degrees of freedom Early stopping misses combinations of attributes that are informative. 21

Decision Trees k -Nearest Neighbor Further Issues Linear Models Missing data Multivalued attributes Continuous input attributes Continuous-valued output attributes 22

Decision Trees k -Nearest Neighbor Decision Trees Linear Models 23

Supervised Learning Decision Trees and Linear Models Marco - PowerPoint PPT Presentation

Lecture 10 Supervised Learning Decision Trees and Linear Models Marco Chiarandini Department of Mathematics & Computer Science University of Southern Denmark Slides by Stuart Russell and Peter Norvig Decision Trees k -Nearest Neighbor

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Supervised Learning via Decision Trees Lecture 4 Supervised Learning via Decision Trees October

Supervised Learning via Decision Trees Lecture 8 Supervised Learning via Decision Trees March

Supervised Learning via Decision Trees Lecture 9 Supervised Learning via Decision Trees March

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

PCA CS 446 Supervised learning So far, weve done supervised learning: Given (( x i , y i )) ,

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Linear regression Linear regression is a simple approach to supervised learning. It assumes

Decision tree learning Aim: find a small tree consistent with the training examples Idea:

Decision Trees Petr Pok Czech Technical University in Prague Faculty of Electrical

Machine Learning I: Decision Trees AI Class 14 (Ch. 18.118.3) Cynthia Matuszek CMSC 671

and Random Forests Pr. Fabien MOUTARDE Center for Robotics MINES ParisTech PSL Universit

Applied Machine Learning Decision Trees Siamak Ravanbakhsh COMP 551 (Fall 2020)

Foundations of Artificial Intelligence 14. Machine Learning Learning from Observations Joschka

Contents Introduction Linear Regression Generalized Linear Regression Decision Trees with

Decision Trees II Dr. Alex Williams August 26, 2020 COSC 425: Introduction to Machine Learning