ece 4524 artificial intelligence and engineering
play

ECE 4524 Artificial Intelligence and Engineering Applications - PowerPoint PPT Presentation

ECE 4524 Artificial Intelligence and Engineering Applications Lecture 23: Learning Theory Reading: AIAMA 18.4-18.5 Todays Schedule: Evaluating Hypotheses/Models PAC Learning and Sample Complexity Assumptions about Training and


  1. ECE 4524 Artificial Intelligence and Engineering Applications Lecture 23: Learning Theory Reading: AIAMA 18.4-18.5 Today’s Schedule: ◮ Evaluating Hypotheses/Models ◮ PAC Learning and Sample Complexity

  2. Assumptions about Training and Testing Sets Critical assumptions of supervised learning are ◮ the true f does not change, it is stationary ◮ the samples from f are independent and identically distributed (IID)

  3. Error Rate We define the error rate as the proportion of mistakes made by h over a set of N examples N Error Rate = 1 � ✶ y i � = h ( x i ) N i =1 where ✶ is the indicator function. ◮ When this error rate is zero over the training set, h is said to be consistent. ◮ It is always possible to find a hypothesis space H complex enough so that some h ∈ H is consistent.

  4. Test Error Rate Thus we are more concerned with the test error rate . ◮ A low test error indicate h generalizes well ◮ Often a consistent hypothesis has worse generalization than a less-complex one. ◮ This trade-off between the complexity of H and the test performance is the core of supervised machine learning.

  5. Cross-Validation ◮ So, the test error is the final word on the performance of h , but recall that we can only use the test set once . Otherwise we are said to be peeking . ◮ However, if we use the entire training set for training we will likely over-train. ◮ The answer is to use cross-validation to estimate the generalization performance of h . We partition the training set into a training and validation set. ◮ holdout cross-validation - reserve a percentage (typically 1/3) from D for validation. ◮ k -fold cross-validation - generate k independent subsets of D . giving k estimates of generalization performance ◮ when k = N this is called leave-one-out cross-validation.

  6. Selecting Hypothesis Complexity So, to select an optimal h we need a learning algorithm , a way to optimize the parameters over a given set H ◮ Define the size of H as some parameter which adjusts the complexity of H . ◮ For increasing values of size use cross-validation and the learning algorithm to give an estimate of the training and validation error. ◮ stop when h is consistent or the training error has converged ◮ search backwards to find the size with the smallest validation error ◮ finally, train h at the optimal size using the full training set.

  7. Loss Functions Minimizing the error rate assumes that all errors are equal in the success of the agent. From our discussion of Utility we know this is not true. ◮ In ML it is traditional to work with a cost rather than utility via a loss function . L ( x , y , ˆ y ) = U (result of y given x ) − U (result of ˆ y given x ) where y = f ( x ) and ˆ y = h ( x ) ◮ We often assume no dependence on x on the loss so we just have L ( y , ˆ y ).

  8. Empirical Loss ◮ We would like to minimize the expected loss over the validation set N � L ( y i , h ( x i )) P ( x i , y i ) i =1 however we don’t know the joint probability ◮ Instead we assume a uniform distribution and optimize the empirical loss N 1 � L ( y i , h ( x i )) N i =1

  9. Probably Approximately Correct Learning For Boolean functions (binary classifiers) define: � � error( h ) = L 0 / 1 ( y , h ( x )) x y N ≥ 1 ln 1 � � δ + ln |H| ǫ

  10. Next Actions ◮ Reading on Linear Models (AIAMA 18.6) ◮ No warmup. Reminders: ◮ Quiz 3 is this Thursday (4/12).

Recommend


More recommend