ece 4524 artificial intelligence and engineering
play

ECE 4524 Artificial Intelligence and Engineering Applications - PowerPoint PPT Presentation

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Todays Schedule: Motivation for Learning Types of Learning Supervised Learning and Hypothesis spaces


  1. ECE 4524 Artificial Intelligence and Engineering Applications Lecture 22: Introduction to Learning Reading: AIAMA 18.1-18.3 Today’s Schedule: ◮ Motivation for Learning ◮ Types of Learning ◮ Supervised Learning and Hypothesis spaces ◮ Example: Decision Trees

  2. Why learning? ◮ not all information is known at design time ◮ it might be impractical to program all possibilities directly ◮ some agents need to be able to adapt over time ◮ we might not know how to solve a problem directly by design This area in general is referred to as Machine Learning .

  3. Learning is a very general concept. It can be applied to all elements of an agents design, e.g. we might ◮ learn functions mapping percepts to internal states ◮ learn functions mapping states to actions ◮ learn the agent model itself ◮ learn probabilities ◮ learn utilities of internal states or actions Any agent component with a representation, prior knowledge of the representation, and a way to update the representation using feedback can use learning methods.

  4. Categorization of Learning The most basic distinction in learning is the difference between ◮ Deductive Learning ◮ Inductive Learning Within inductive learning there is ◮ unsupervised learning ◮ reinforcement learning ◮ supervised learning

  5. Supervised Learning Supervised learning is conceptually very simple, but has many practical and subtle issues. ◮ Given a training set consisting of examples D = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x n , y n ) } where each example obeys y i = f ( x i ) for some unknown function f ( · ). ◮ Find a function, the hypothesis h ( · ) y = h ( x ) that approximates the true f .

  6. The quality of the approximation is measured using the Test Set . T = { ( x 1 , y 1 ) , ( x 2 , y 2 ) , · · · , ( x m , y m ) } where m < n and T ∩ D = ∅ ◮ Collecting training and testing sets is often hard and expensive ◮ a h that performs well on the test set is said to generalize well. ◮ an h that performs well on the training set (said to be consistent) but poorly on the test set is said to be over-trained . Note the test set is independent of the training set!

  7. Some Nomenclature ◮ When y is finite with a categorical interpretation, this is a classification problem ◮ If y is binary it is a binary classification problem ◮ If y is continuous then it is a regression problem.

  8. Hypothesis Space In y = h ( x ), h is a hypothesis in some space of functions H . ◮ Goal is to find a consistent h with smallest testing error and the simplest representation (Ockham’s Razor) ◮ If we restrict the space H then it may be that no h can be found which approximates f sufficiently (unrealizable). ◮ The complexity/expressiveness of H and the generalization of h ∈ H is related through the bias-variance dilemma .

  9. Bayesian analysis gives us a useful framework for supervised learning ◮ Let h ∈ H be parameterized by θ , and the training data given by D , then the posterior of the parameters is p ( θ | D , h ) = p ( D | θ, h ) p ( θ | h ) P ( D | h ) ◮ The posterior of the model is the evidence for h p ( h | D ) = p ( D | h ) p ( h ) P ( D ) where the denominator integrates over all models in H

  10. Bayesian analysis gives us a useful framework for supervised learning ◮ The maximum likelihood model ignores the prior over models argmax P ( D | h ) h and is the model with the most evidence. ◮ The maximum a-posteriori (MAP) model includes the prior over models p ( D | h ) p ( h ) argmax p ( h | D ) = argmax () h h where the denominator P ( D ) is common to all models and so irrelevant to the model selection. We can also average models by choosing the top models rather than a single model. This is particularly useful in binary classification, where the models can simply vote on the final classifier output.

  11. Utility of models ◮ We assume the true f(x) is stationary and samples are IID. ◮ The error rate is the proportion of incorrect classifications. ◮ Note the error rate may be misleading since it makes no distinction about utility differences. Example: Binary classifier has 4 cases: TP, FP, TN, FN ◮ The cost of a FP or TN may not be the same. ◮ This is accounted for via a utility/loss function.

  12. Sources of Model Error ◮ The estimated h may differ from the true f because 1. the space H is overly restrictive (unrealizable) 2. the variance is large (high degrees of freedom) 3. f itself may be non-deterministic (noisy) 4. f is ”too complex” ◮ Most of Machine Learning has been focused on 1 and 2. ◮ A large open area in machine learning now is 4, ”learning in the large” (e.g. neuroscience, bioinformatics, sociology, networks)

  13. An example learning method: Decision Trees Consider a simple reflex agent that reasons by testing a series of attribute = value pairs. ◮ Let x be a vector of attributes ◮ Let y be a +/- or 0/1 assignment for a Goal (a binary classifier) ◮ Given D = ( x i , y i ) for i = 1 · · · N build the tree of decisions formed by testing the attributes of x individually.

  14. Implementing the importance function The idea is that we want to select the attribute that maximizes our ”surprise” ◮ The entropy of a R.V. V with values v k measures it’s uncertainty � H ( V ) = − p ( v k ) log 2 ( p ( v k )) in bits k ◮ For a Boolean R.V. with probability of true = q the Entropy is B ( q ) = − ( q log 2 q + (1 − q ) log 2 (1 − q )) where q ≈ p / ( p + n ) for p positive and n negative samples.

  15. Implementing the importance function Now suppose we choose attribute A from x ◮ For each possible value of A we divide the training set into k subsets with p k positive and n k negative examples ◮ After testing A , the remaining entropy is d p k + n k � � p k � remainder( A ) = p + n B p k + n k k =1 ◮ The information gain associated with selecting A is then � � p gain( A ) = B − remainder( A ) p + n We choose the attribute with the highest gain in information.

  16. Next Actions ◮ Reading on Learning Theory (AIAMA 18.4-18.5) ◮ No warmup. Reminders: ◮ Quiz 3 will be Thurday 4/12. ◮ PS 3 is due tonight.

Recommend


More recommend