Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and test sets � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 1

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 2

Supervised Learning Given: a set of inputs features X 1 , . . . , X n a set of target features Y 1 , . . . , Y k a set of training examples where the values for the input features and the target features are given for each example a new example, where only the values for the input features are given predict the values for the target features for the new example. classification when the Y i are discrete regression when the Y i are continuous � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 3

Example Data Representations A travel agent wants to predict the preferred length of a trip, which can be from 1 to 6 days. (No input features). Two representations of the same data: — Y is the length of trip chosen. — Each Y i is an indicator variable that has value 1 if the chosen length is i , and is 0 otherwise. Example Y Example Y 1 Y 2 Y 3 Y 4 Y 5 Y 6 e 1 1 e 1 1 0 0 0 0 0 6 0 0 0 0 0 1 e 2 e 2 6 0 0 0 0 0 1 e 3 e 3 2 0 1 0 0 0 0 e 4 e 4 1 1 0 0 0 0 0 e 5 e 5 What is a prediction? � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 4

Evaluating Predictions Suppose we want to make a prediction of a value for a target feature on example e : o e is the observed value of target feature on example e . p e is the predicted value of target feature on example e . The error of the prediction is a measure of how close p e is to o e . There are many possible errors that could be measured. Sometimes p e can be a real number even though o e can only have a few values. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 5

Measures of error E is the set of examples, with single target feature. For e ∈ E , o e is observed value and p e is predicted value: � absolute error L 1 ( E ) = | o e − p e | e ∈ E � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 6

Measures of error E is the set of examples, with single target feature. For e ∈ E , o e is observed value and p e is predicted value: � absolute error L 1 ( E ) = | o e − p e | e ∈ E � sum of squares error L 2 ( o e − p e ) 2 2 ( E ) = e ∈ E � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 7

Measures of error E is the set of examples, with single target feature. For e ∈ E , o e is observed value and p e is predicted value: � absolute error L 1 ( E ) = | o e − p e | e ∈ E � sum of squares error L 2 ( o e − p e ) 2 2 ( E ) = e ∈ E worst-case error : L ∞ ( E ) = max e ∈ E | o e − p e | � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 8

Measures of error E is the set of examples, with single target feature. For e ∈ E , o e is observed value and p e is predicted value: � absolute error L 1 ( E ) = | o e − p e | e ∈ E � sum of squares error L 2 ( o e − p e ) 2 2 ( E ) = e ∈ E worst-case error : L ∞ ( E ) = max e ∈ E | o e − p e | number wrong : L 0 ( E ) = # { e : o e � = p e } � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 9

Measures of error E is the set of examples, with single target feature. For e ∈ E , o e is observed value and p e is predicted value: � absolute error L 1 ( E ) = | o e − p e | e ∈ E � sum of squares error L 2 ( o e − p e ) 2 2 ( E ) = e ∈ E worst-case error : L ∞ ( E ) = max e ∈ E | o e − p e | number wrong : L 0 ( E ) = # { e : o e � = p e } A cost-based error takes into account costs of errors. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 10

Measures of error (cont.) With binary feature: o e ∈ { 0 , 1 } : likelihood of the data � p o e e (1 − p e ) (1 − o e ) e ∈ E � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 11

Measures of error (cont.) With binary feature: o e ∈ { 0 , 1 } : likelihood of the data � p o e e (1 − p e ) (1 − o e ) e ∈ E log likelihood � ( o e log p e + (1 − o e ) log(1 − p e )) e ∈ E is negative of number of bits to encode the data given a code based on p e . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 12

Information theory overview A bit is a binary digit. 1 bit can distinguish 2 items k bits can distinguish 2 k items n items can be distinguished using log 2 n bits Can we do better? � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 13

Information and Probability Consider a code to distinguish elements of { a , b , c , d } with P ( a ) = 1 2 , P ( b ) = 1 4 , P ( c ) = 1 8 , P ( d ) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P ( a ) × 1 + P ( b ) × 2 + P ( c ) × 3 + P ( d ) × 3 1 2 + 2 4 + 3 8 + 3 8 = 13 = 4 bits. The string aacabbda has code � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 14

Information and Probability Consider a code to distinguish elements of { a , b , c , d } with P ( a ) = 1 2 , P ( b ) = 1 4 , P ( c ) = 1 8 , P ( d ) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P ( a ) × 1 + P ( b ) × 2 + P ( c ) × 3 + P ( d ) × 3 1 2 + 2 4 + 3 8 + 3 8 = 13 = 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 15

Information and Probability Consider a code to distinguish elements of { a , b , c , d } with P ( a ) = 1 2 , P ( b ) = 1 4 , P ( c ) = 1 8 , P ( d ) = 1 8 Consider the code: a 0 b 10 c 110 d 111 This code uses 1 to 3 bits. On average, it uses P ( a ) × 1 + P ( b ) × 2 + P ( c ) × 3 + P ( d ) × 3 1 2 + 2 4 + 3 8 + 3 8 = 13 = 4 bits. The string aacabbda has code 00110010101110. The code 0111110010100 represents string adcabba � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 16

Information Content To identify x , we need − log 2 P ( x ) bits. Give a distribution over a set, to a identify a member, the expected number of bits � − P ( x ) × log 2 P ( x ) . x is the information content or entropy of the distribution. The expected number of bits it takes to describe a distribution given evidence e : � I ( e ) = − P ( x | e ) × log 2 P ( x | e ) . x � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 17

Information Gain Given a test that can distinguish the cases where α is true from the cases where α is false, the information gain from this test is: I ( true ) − ( P ( α ) × I ( α ) + P ( ¬ α ) × I ( ¬ α )) . I ( true ) is the expected number of bits needed before the test P ( α ) × I ( α ) + P ( ¬ α ) × I ( ¬ α ) is the expected number of bits after the test. � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 18

Linear Predictions 8 L ∞ 7 6 5 4 3 2 1 0 0 1 2 3 4 5 � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 19

Linear Predictions 8 L ∞ 7 L 2 6 2 5 4 3 L 1 2 1 0 0 1 2 3 4 5 � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 20

Point Estimates To make a single prediction for feature Y , with examples E . The prediction that minimizes the sum of squares error on E is � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 21

Point Estimates To make a single prediction for feature Y , with examples E . The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 22

Point Estimates To make a single prediction for feature Y , with examples E . The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 23

Point Estimates To make a single prediction for feature Y , with examples E . The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 24

Point Estimates To make a single prediction for feature Y , with examples E . The prediction that minimizes the sum of squares error on E is the mean (average) value of Y . The prediction that minimizes the absolute error on E is the median value of Y . The prediction that minimizes the number wrong on E is � D. Poole and A. Mackworth 2010 c Artificial Intelligence, Lecture 7.2, Page 25

Learning Objectives At the end of the class you should be able to: - PowerPoint PPT Presentation

Learning Objectives At the end of the class you should be able to: identify a supervised learning problem characterize how the prediction is a function of the error measure avoid mixing the training and test sets D. Poole and A. Mackworth

Objectives Objectives Objectives Objectives Learning Learning Learning Learning

The Learning Tree Workshop: The Learning Tree Workshop: Experience-based Learning Series on

PVMD Delft University of Technology Learning objectives Improved light calibration Learning

Machine Learning 11 AI Slides (6e) c Lin Zuoquan@PKU 1998-2020 11 1 11 Machine Learning

What is mobile learning, mobile learning policies and technologies Dr. Mohamed Ally Learning

Standards for Professional Learning Standards for Professional Learning Learning objectives

Designs Learning designs Learning objectives Learners will be able to Provide a

Trigeminal Autonomic Cephalalgias Learning Objectives Learning Objectives At the conclusion

Learning Objectives Learning Objectives Overuse Injuries in Overuse Injuries in Endurance

Rate vs temporal code about synchrony Learning objectives: Learning objectives: To

Testing Object Oriented Software Chapter 15 p Learning objectives Learning objectives

PVMD Ren van Swaaij Delft University of Technology Learning objectives Why use a

Year 7 Learning Evening 2017 W elcome! Year 7 Learning Evening 2017 Year 7 Learning Evening

Learning is a never-ending process Tasks come and go, but learning is forever Learn more e ff

Welcome to the Machine Learning Toolbox! Machine Learning Toolbox Supervised learning caret

A Gentle Introduction to Machine Learning Supervised learning, unsupervised learning (very

CSC 4181 Compiler Construction Context-Free Grammars Using grammars in parsers CFG 1 1

CSE 501: Course outline Implementation of Programming Languages Models of compilation/analysis

Particularity and relevance of the framework in relation to the organization and functioning of

Introduction ~ Managing A Dynamic Workplace A professional is a man who can do his job when

Model Generation Method 2ai The Model Generation (MG) method is an easy method for humans to

Designing Grace Why Now? Happy teaching Java next 3-5 years Can an Introductory Programming In

S u p e r c h a r g e Y o u r S a l e s W i t h M o r t g a g e Q

Static Analysis of Race-Free Interrupt-Driven Programs Deepak DSouza Department of Computer