Lecture 21: Classification; Decision Trees Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS440/ECE448: Intro to Artificial Intelligence � Lecture 21:   Classification;   Decision Trees � Prof. Julia Hockenmaier � juliahmr@illinois.edu � � http://cs.illinois.edu/fa11/cs440 � � �

Supervised learning:   classification �

Supervised learning � Given a set D of N items x i , each paired with an output value y i = f( x i ) , discover a function h( x ) which approximates f( x ) D = {( x 1 , y 1 ),… ( x N , y N )} Typically, the input values x are (real-valued or boolean) vectors: x i ∈ R n or x i ∈ {0,1} n � The output values y are either boolean (binary classification) , elements of a finite set (multiclass classification) , or real (regression)

  Supervised learning � train � test � Training: find h( x )   Given a training set D train of items ( x i , y i = f( x i )) , return a function h( x ) which approximates f( x ) � Testing: how well does h( x ) generalize?   Given a test set D test of items x i that is disjoint from D train , evaluate how close h( x ) is to f( x ) . � – (classification) accuracy: pctg. of x i ∈ D test : h( x i ) = f( x i ) CS440/ECE448: Intro AI � 4 �

N -fold cross-validation � A better indication of how well h(x) generalizes: � – Split data into N equal-sized parts, � – Run and evaluate N experiments � – Report average accuracy, variance, etc. � CS440/ECE448: Intro AI � 5 �

The Naïve Bayes Classifier � C � A1 � A2 � An � … Each item has a number of attributes   A 1 =a 1 ,…,A n =a n We predict the class c based on � c = argmax c ∏ i P(A i = a i | C=c) P(C=c) CS440/ECE448: Intro AI � 6 �

An example � x1 � x2 � Y � A1: drink � A2: milk? � C: sugar? � coffee � no � yes � coffee � yes � no � tea � yes � yes � tea � no � no � Can you train a Naïve Bayes classifier to predict whether the customer wants sugar or not? � � What is P(coffee | sugar)? � CS440/ECE448: Intro AI � 7 �

Questions that came up in class… � What are the independence assumptions that Naïve Bayes makes? � � Are drink and milk independent R.V.s? � Are they conditionally independent, given sugar? � What happens when your Bayes Net makes independence assumptions that are incorrect? � CS440/ECE448: Intro AI � 8 �

Decision trees �

Decision trees � drink? � coffee � tea � milk? � milk? � no � yes � yes � no � no sugar � sugar � sugar � no sugar � In this example, the attributes (drink; milk?) are not conditionally independent given the class ( ʻ sugar ʼ ) � CS440/ECE448: Intro AI � 10 �

What is a decision tree? � V13 V11 V12 Test 4 Test 2 Test 3 V22 V21 Test 5 Test 6 Label 1 Label 2 Label 1

Suppose I like circles that are red   (I might not be aware of the rule) � Features: � Shape – Owner:   John, Mary, Sam � – Size: Large, Small � Triangle Circle Square – Shape :   - - Triangle, Circle, Square � Color – Texture:   Rough, Smooth � Blue Red Green Yellow Taupe – Color:   - - - - Blue, Red, Green, + Yellow, Taupe � ∀ x [Like(x) ⇔ (Circle(x) ∧ Red(x))]

Suppose I like circles that are red and triangles that are smooth � Shape Triangle Circle Square texture - smooth rough Color - + Blue Red Green Yellow Taupe - - + - - ∀ x [Like(x) ⇔ ((Circle(x) ∧ Red(x) v (Triangle(x) ∧ Smooth(x))]

Expressiveness of decision trees � Consider binary classification (y=true,false) with Boolean attributes. � � Each path from the root to a leaf node is a conjunction of propositions. � � The goal (y=true) corresponds to a disjunction of such conjunctions. � CS440/ECE448: Intro AI � 14 �

How many different decision trees are there? � With n Boolean attributes, there are 2 n possible kinds of examples. � � One decision tree = assign true to one subset of these 2 n kinds of examples. � � There are 2 2 n possible decision trees! � (10 attributes: 2 1024 ≈ 10 308 trees; � 20 attributes ≈ 10 300,000 trees) � � CS440/ECE448: Intro AI � 15 �

Example space   and hypothesis space � Example space: � The set of all possible examples x � (this depends on our feature representation) � � Hypothesis space: � The set of all possible hypotheses h( x )   that a particular classifier can express. � � CS440/ECE448: Intro AI � 16 �

Machine Learning as an Empirically Guided Search through the Hypothesis Space � - + - + + - + - - Examples Hypotheses

What makes a (test / split / feature) useful? � Improved homogeneity � – Entropy reduction = Information gain � To evaluate a split utility � – Measure entropy / information required before � – Measure entropy / information required after � – Subtract � � Entropy: expected number of bits to communicate the label of an item chosen randomly from a set �

Training Data + - - + + + - - + - + - + + Highly Disorganized - - + + + - - + - + - - + - - + - + - - + - + - + + - - + High Entropy + - - - + - + - + + - - + + Much Information Required + - - + - + - + + - - + - + - - + + + - + - + - - + - + - + + - + - + + + - - + + + + - - - + - - - - + - + - - + - + + + + + + - - + - - - - - + - + - + - - - - - - - + + + - - - - - - + + + + + - - - - - - - - - - - - + + + + + + + + + Highly Organized - - - - - + + + Low Entropy + + + Little Information Required

Measuring Information   H denotes Information Need or Entropy � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � S 1 = + + +

Measuring Information � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � What is H(S 2 ) ? � - - - S 2 = -

Measuring Information � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � + + + + + + + + + + + + What is H(S 2 ) ? � + + + + + + + + + + + + S 3 = + + + + + + + + + + + + What is H(S 3 ) ? � + + + + + + + + + + + + �

Measuring Information � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � What is H(S 2 ) ? � S 4 = + - What is H(S 3 ) ? � What is H(S 4 ) ? �

Measuring Information � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � + + + + + + + + + + + + What is H(S 2 ) ? � + + + + + + + + + + + + S 5 = - - - - - - - - - - - - What is H(S 3 ) ? � - - - - - - - - - - - - What is H(S 4 ) ? � What is H(S 5 ) ? �

Measuring Information � H(S) = bits required to label some x ∈ S � What is the upper bound if label ∈ {+,-} � What is H(S 1 ) ? � + + + + + + + + + + + + What is H(S 2 ) ? � + + + + + + + + + + + + S 6 = + + + + + + + + + + + + What is H(S 3 ) ? � + + + + + + + + + + + - What is H(S 4 ) ? � What is H(S 5 ) ? � Think of expected number of bits � What is H(S 6 ) ? � � H(S 6 ) should be closer to 0 than to 1 �

Measuring Information � H(S) = bits required to label some x ∈ S � Label ∈ {A,B,C,D,E,F}, Upper bound now? � What is H(S 7 ) ? � F A B B A A B A � D A A A D A B E S 7 = A F A A B B A C FOR SAY A E B A A A B C A 1 B 01 A A A A A A A A C 0000 16 A A A A A A A A = D 0001 B B B B B B B B 8 E 0010 C C D D E E F F 2 2 2 2 F 0011 Sometimes needs 4 bits / label (worse than 3)

Measuring Information � What is the expected number of bits? � – 16/32 use 1 bit � A A A A A A A A 16 A A A A A A A A S 7 = – 8/32 use 2 bits � B B B B B B B B 8 – 4 x 2/32 use 4 bits � C C D D E E F F 2 2 2 2 � FOR SAY 0.5(1) + 0.25(2) + 0.0625(4) +   0.0625(4) + 0.0625(4) + 0.0625(4) � A 1 � B 01 = 0.5 + 0.5 + 0.25 + 0.25 + 0.25 + 0.25 � C 0000 = 2 � D 0001 � E 0010 F 0011 $ H ( S ) = ! P( v ) " log 2 (P( v )) v # Labels

From N + , N - to H(P) � Entropy of a distribution H(P) 1.2 For Binomial: P = N + / (N + + N - ) 1 0.8 Entropy: Entropy(P) 0.6 H(P) = 0.4 -P log 2 (P) – (1-P) log 2 (1-P) 0.2 H(9/14) = H(0.64) = 0.940 0 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1 P 28 �

Information Gain � S b w/ H(S b ) + - - + + + - - + - + - + + - - + + + - - + - + - - + - - + - + - - + - + - + + - - + + - - - + - + - + + - - + + + - - + - + - + + - - + - + - - + + + - + - + - - + - + - + + - + - + + + - - + + + + - - - + - - - - + - + - - + - + + + + + + - - + - - - S a3 w/ H(S a3 ) S a1 w/ H(S a1 ) S a2 w/ H(S a2 )

Information Gain � Idea: subtract information required after split from the information required before the split. � � Information required before the split: H(S b ) � � Information required after the split:   P(S a1 ) ⋅ H(S a1 ) + P(S a2 ) ⋅ H(S a2 ) + P(S a3 ) ⋅ H(S a3 ) � P(S a1 ): use sample counts � � � Information Gain =

An example �

Will I Play Tennis? � Features: � – Outlook: � � Sun, Overcast, Rain � – Temperature: � Hot, Mild, Cool � – Humidity: � � High, Normal, Low � – Wind: � � � Strong, Weak � – Label: � � � +, - � Features are evaluated in the morning � Tennis is played in the afternoon � 32 �

Lecture 21: Classification; Decision Trees Prof. Julia Hockenmaier - PowerPoint PPT Presentation

CS440/ECE448: Intro to Artificial Intelligence Lecture 21: Classification; Decision Trees Prof. Julia Hockenmaier juliahmr@illinois.edu http://cs.illinois.edu/fa11/cs440 Supervised learning:

Decision Trees Lecture 23 To left or to right 1 Decision Trees 2 Decision Trees A different

Decision Trees Lecture 22 To left or to right 1 Decision Trees 2 Decision Trees A different

Learning Decision Trees Representation is a decision tree. Bias is towards simple decision

Trees Trees CSE, IIT KGP Trees and Spanning Trees Trees and Spanning Trees A graph having

Outline Univariate Trees 1 Decision Trees Classification Regression Pruning Steven J Zeil

( ( ) ) ( ) ( ) = = Work = h log t n B- B -Trees Trees B B- -Trees

Trees Chapter 11 Chapter Summary Introduction to Trees Applications of Trees Tree

Trees Eric McCreath Overview In this lecture we will explore: general trees, binary trees,

Decision Tree R Greiner Cmput 466 / 551 Learning Decision Trees Def'n: Decision Trees

Lecture 23: Decision Trees Decision trees Prof. Julia Hockenmaier

Tree Computation for Ranking and Classification CS240A, T. Yang, 2016 Outlines Decision Trees

Learning Decision Trees Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Decision Trees: Discussion Machine Learning 1 Some slides from Tom Mitchell, Dan Roth and others

Splay Trees and B-Trees CSE 373 Data Structures Lecture 9 Readings Reading Sections

2-3-4 Trees and Red- Black Trees 204 erm CS 16: Balanced Trees 2-3-4 Trees Revealed Nodes

/ + - * * 5 3 2 6 5 2 Examples Binary Trees BSTs Augmenting BinExpr General Trees

Explanatory Fictions and Fictional Explanations Sorin Bangu Univ. of Bergen

3.3: Time Series and Index Numbers 1. Time series: Plots Components 2. Index numbers: Simple

Fractions (including decimals) Counting Up and Down in Tenths Recognise that tenths arise from

Eric Monteiro https://www.ntnu.no/ansatte/ericm Content Background digitalization,

There's More Than One Way To Dispatch It Jonathan Worthington Nordic Perl Workshop 2009 There's

WITNESS STATEMENT Blasting and Our Concentration of Heritage Stone Structures Glen Duff 14267 4

CHP Multi-State Working Group Hosted by Todd Olinsky-Paul, Project Director, CESA Friday, May 1,

F I R M S A N D M A R K E T S I I MPA 612: Economy, Society, and Public Policy February 27,