cs534 machine learning cs534 machine learning
play

CS534: Machine Learning CS534: Machine Learning Thomas G. - PowerPoint PPT Presentation

CS534: Machine Learning CS534: Machine Learning Thomas G. Dietterich Thomas G. Dietterich 221C Dearborn Hall 221C Dearborn Hall tgd@cs.orst.edu tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 http://www.cs.orst.edu/~tgd/classes/534


  1. CS534: Machine Learning CS534: Machine Learning Thomas G. Dietterich Thomas G. Dietterich 221C Dearborn Hall 221C Dearborn Hall tgd@cs.orst.edu tgd@cs.orst.edu http://www.cs.orst.edu/~tgd/classes/534 http://www.cs.orst.edu/~tgd/classes/534 1 1

  2. Course Overview Course Overview Introduction: Introduction: – Basic problems and questions in machine learning. Example applications cations – Basic problems and questions in machine learning. Example appli Linear Classifiers Linear Classifiers Five Popular Algorithms Five Popular Algorithms – – Decision trees (C4.5) Decision trees (C4.5) – Neural networks (backpropagation) – Neural networks (backpropagation) – Probabilistic networks (Naï ïve Bayes; Mixture models) ve Bayes; Mixture models) – Probabilistic networks (Na – Support Vector Machines (SVMs) – Support Vector Machines (SVMs) – – Nearest Neighbor Method Nearest Neighbor Method Theories of Learning: Theories of Learning: – – PAC, Bayesian, Bias- PAC, Bayesian, Bias -Variance analysis Variance analysis Optimizing Test Set Performance: Optimizing Test Set Performance: – Overfitting, Penalty methods, Holdout Methods, Ensembles – Overfitting, Penalty methods, Holdout Methods, Ensembles Sequential and Spatial Data Sequential and Spatial Data – Hidden Markov models, Conditional Random Fields; Hidden Markov SVMs VMs – Hidden Markov models, Conditional Random Fields; Hidden Markov S Problem Formulation Problem Formulation – Designing Input and Output representations – Designing Input and Output representations 2 2

  3. Supervised Learning Supervised Learning – Given: Training examples Given: Training examples h , f f ( ( x ) i i for some unknown function for some unknown function f f . . – x , x ) h x – Find: A good approximation to – Find: A good approximation to f f . . Example Applications Example Applications – Handwriting recognition Handwriting recognition – x: data from pen motion x: data from pen motion f(x): letter of the alphabet f(x): letter of the alphabet – Disease Diagnosis – Disease Diagnosis x: properties of patient (symptoms, lab tests) x: properties of patient (symptoms, lab tests) f(x): disease (or maybe, recommended therapy) f(x): disease (or maybe, recommended therapy) – Face Recognition Face Recognition – x: bitmap picture of person x: bitmap picture of person’ ’s face s face f(x): name of person f(x): name of person – Spam Detection Spam Detection – x: email message x: email message f(x): spam or not spam f(x): spam or not spam 3 3

  4. Appropriate Applications for Appropriate Applications for Supervised Learning Supervised Learning Situations where there is no human expert Situations where there is no human expert – x: bond graph of a new molecule x: bond graph of a new molecule – – f(x): predicted binding strength to AIDS protease molecule f(x): predicted binding strength to AIDS protease molecule – Situations were humans can perform the task but can’ ’t describe how t describe how Situations were humans can perform the task but can they do it they do it – x: bitmap picture of hand x: bitmap picture of hand- -written character written character – – – f(x): ascii code of the character f(x): ascii code of the character Situations where the desired function is changing frequently Situations where the desired function is changing frequently – x: description of stock prices and trades for last 10 days x: description of stock prices and trades for last 10 days – – f(x): recommended stock transactions f(x): recommended stock transactions – Situations where each user needs a customized function f Situations where each user needs a customized function f – x: incoming email message x: incoming email message – – f(x): importance score for presenting to the user (or deleting w f(x): importance score for presenting to the user (or deleting without ithout – presenting) presenting) 4 4

  5. test point Formal Formal , y y i P( x , y ) x , h x h i training points Setting y Setting x Training learning f sample algorithm Training examples are drawn Training examples are drawn ŷ y independently at random according to independently at random according to loss unknown probability distribution P( x , y y ) ) unknown probability distribution P( x , function The learning algorithm analyzes the The learning algorithm analyzes the examples and produces a classifier f f examples and produces a classifier L( ŷ ,y ) Given a new data point h , y y i i drawn from P, drawn from P, Given a new data point x , h x the classifier is given x x and predicts and predicts ŷ ŷ = = f f ( ( x ) the classifier is given x ) The loss L( ŷ ŷ ,y ,y ) is then measured ) is then measured The loss L( Goal of the learning algorithm: Find the f f Goal of the learning algorithm: Find the that minimizes the expected loss expected loss that minimizes the 5 5

  6. Formal Version of Spam Detection Formal Version of Spam Detection P( x , y y ): distribution of email messages ): distribution of email messages x x and their and their P( x , true labels y y ( (“ “spam spam” ” or or “ “not spam not spam” ”) ) true labels training sample: a set of email messages that have training sample: a set of email messages that have been labeled by the user been labeled by the user learning algorithm: what we study in this course! learning algorithm: what we study in this course! f : the classifier output by the learning algorithm : the classifier output by the learning algorithm f test point: A new email message x x (with its true, but (with its true, but test point: A new email message hidden, label y y ) ) hidden, label true label y y true label loss function L( L( ŷ ŷ ,y) ,y) : : loss function predicted predicted spam not spam not label ŷ ŷ label spam spam spam spam 0 0 10 10 not spam not spam 1 1 0 0 6 6

  7. Three Main Approaches to Three Main Approaches to Machine Learning Machine Learning Learn a classifier: a function f f . . Learn a classifier: a function Learn a conditional distribution: a conditional Learn a conditional distribution: a conditional distribution P( y y | | x ) distribution P( x ) Learn the joint probability distribution: P( x , y y ) ) Learn the joint probability distribution: P( x , In the first two weeks, we will study one example In the first two weeks, we will study one example of each method: of each method: – Learn a classifier: The LMS algorithm Learn a classifier: The LMS algorithm – – Learn a conditional distribution: Logistic regression Learn a conditional distribution: Logistic regression – – Learn the joint distribution: Linear discriminant Learn the joint distribution: Linear discriminant – analysis analysis 7 7

  8. Infering a classifier f f from P( from P( y y | | x ) Infering a classifier x ) Predict the ŷ ŷ that minimizes the expected that minimizes the expected Predict the loss: loss: f ( x ) = argmin E y | x [ L (ˆ y, y )] ˆ y X = argmin P ( y | x ) L (ˆ y, y ) ˆ y y 8 8

  9. Example: Making the spam decision Example: Making the spam decision Suppose our spam detector Suppose our spam detector predicts that P( y y = =“ “spam spam” ” | | x ) = predicts that P( x ) = 0.6. What is the optimal 0.6. What is the optimal true label y y true label classification decision ŷ ŷ ? ? classification decision predicted predicted spam not spam not label ŷ ŷ label spam spam Expected loss of ŷ ŷ = = “ “spam spam” ” is is Expected loss of spam spam 0 0 10 10 0 * 0.6 + 10 * 0.4 = 4 0 * 0.6 + 10 * 0.4 = 4 not spam not spam 1 1 0 0 Expected loss of ŷ ŷ = = “ “no spam no spam” ” Expected loss of P( y y | | x ) P( x ) 0.6 0.6 0.4 0.4 is 1 * 0.6 + 0 * 0.4 = 0.6 is 1 * 0.6 + 0 * 0.4 = 0.6 Therefore, the optimal Therefore, the optimal prediction is “ “no spam no spam” ” prediction is 9 9

  10. Inferring a classifier from Inferring a classifier from the joint distribution P( x , y y ) ) the joint distribution P( x , We can compute the conditional distribution We can compute the conditional distribution according to the definition of conditional according to the definition of conditional probability: probability: P ( x , y = k ) P ( y = k | x ) = j P ( x , y = j ) . P In words, compute P( x , y=k y=k ) for each value of ) for each value of k k . . In words, compute P( x , Then normalize these numbers. Then normalize these numbers. Compute ŷ ŷ using the method from the previous using the method from the previous Compute slide slide 10 10

  11. Fundamental Problem of Machine Fundamental Problem of Machine Learning: It is ill- -posed posed Learning: It is ill Example x 1 x 2 x 3 x 4 y 1 0 0 1 0 0 2 0 1 0 0 0 3 0 0 1 1 1 4 1 0 0 1 1 5 0 1 1 0 0 6 1 1 0 0 0 7 0 1 0 1 0 11 11

Recommend


More recommend