learning theory and model selection
play

Learning Theory and Model Selection Weinan Zhang Shanghai Jiao - PowerPoint PPT Presentation

CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html Content Learning Theory Bias-Variance Decomposition


  1. CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html

  2. Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection

  3. Learning Theory • Theorems that characterize classes of learning problems or specific algorithms in terms of computational complexity or sample complexity • i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness

  4. Learning Theory • Complexity of a learning problem depends on: • Size or expressiveness of the hypothesis space • Accuracy to which target concept must be approximated • Probability with which the learner must produce a successful hypothesis • Manner in which training examples are presented, e.g. randomly or by query to an oracle r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness

  5. Model Selection • Which model is the best? Linear model: underfitting 4 th -order model: well fitting 15 th -order model: overfitting • Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. • Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship

  6. Regularization • Add a penalty term of the parameters to prevent the model from overfitting the data N N X X 1 1 min min L ( y i ; f μ ( x i )) + ¸ Ð( μ ) L ( y i ; f μ ( x i )) + ¸ Ð( μ ) N N μ μ i =1 i =1

  7. Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection

  8. Bias Variance Decomposition

  9. Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 Err( x 0 ) = E [( Y ¡ ^ Err( x 0 ) = E [( Y ¡ ^ f ( X )) 2 j X = x 0 ] f ( X )) 2 j X = x 0 ] = E [( ² + f ( x 0 ) ¡ ^ = E [( ² + f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 ))] f ( x 0 ))] | | {z {z } } =0 =0 ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ¡ 2 E [( f ( x 0 ) ¡ E [ ^ ¡ 2 E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ f ( x 0 ))] f ( x 0 ))] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 )] 2 + E [ ^ f ( x 0 )] 2 + E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] 2 ) f ( x 0 )] 2 ) | | {z {z } } =0 =0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ = ¾ 2 = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] = ¾ 2 = ¾ 2 ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ f ( x 0 )) f ( x 0 ))

  10. Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ Err( x 0 ) = ¾ 2 Err( x 0 ) = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ = ¾ 2 = ¾ 2 f ( x 0 )) f ( x 0 )) Observation How far How uncertain away the the prediction is noise (Irreducible expected (given different error) prediction is training settings from the e.g. data and truth initialization)

  11. Illustration of Bias-Variance High Bias Low f ( x ) f ( x ) ^ ^ f ( x ) f ( x ) High Regularization Low Low Variance High Figures provided by Max Welling

  12. Illustration of Bias-Variance regularization • Training error measures bias, but ignores variance. • Testing error / cross-validation error measures both bias and variance. Figures provided by Max Welling

  13. Bias-Variance Decomposition • Schematic of the behavior of bias and variance Closest fit in population Realization Closest fit Truth MODEL SPACE Regularized fit Model bias Estimation bias RESTRICED MODEL SPACE Estimation Variance Slide credit Liqing Zhang

  14. Hypothesis Space ERM Bound Empirical Risk Minimization Finite Hypothesis Space Infinite Hypothesis Space

  15. Machine Learning Process Training Raw Model Data Data Data Formaliz- Evaluation ation Raw Test Data Data • After selecting ‘good’ hyperparameters, we train the model over the whole training data and the model can be used on test data.

  16. Generalization Ability • Generalization Ability is the model prediction capacity on unobserved data • Can be evaluated by Generalization Error, defined by Z Z R ( f ) = E [ L ( Y; f ( X ))] = R ( f ) = E [ L ( Y; f ( X ))] = L ( y; f ( x )) p ( x; y ) dxdy L ( y; f ( x )) p ( x; y ) dxdy X £ Y X £ Y • where is the underlying (probably unknown) p ( x; y ) p ( x; y ) joint data distribution • Empirical estimation of GA on a training dataset is N N X X R ( f ) = 1 R ( f ) = 1 ^ ^ L ( y i ; f ( x i )) L ( y i ; f ( x i )) N N i =1 i =1

  17. A Simple Case Study on Generalization Error • Finite hypothesis set F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g • Theorem of generalization error bound: For any function , with probability no less f 2 F f 2 F than , it satisfies 1 ¡ ± 1 ¡ ± R ( f ) · ^ R ( f ) · ^ R ( f ) + ² ( d; N; ± ) R ( f ) + ² ( d; N; ± ) where r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± • N : number of training instances • d: number of functions in the hypothesis set Section 1.7 in Dr. Hang Li’s text book.

  18. Lemma: Hoeffding Inequality Let be bounded independent random X 1 ; X 2 ; : : : ; X N X 1 ; X 2 ; : : : ; X N variables , the average variable Z is X i 2 [ a; b ] X i 2 [ a; b ] N N X X Z = 1 Z = 1 X i X i N N i =1 i =1 Then the following inequalities satisfy: μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( Z ¡ E [ Z ] ¸ t ) · exp P ( Z ¡ E [ Z ] ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( E [ Z ] ¡ Z ¸ t ) · exp P ( E [ Z ] ¡ Z ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 http://cs229.stanford.edu/extra-notes/hoeffding.pdf

  19. Proof of Generalized Error Bound • For binary classification, the error rate 0 · R ( f ) · 1 0 · R ( f ) · 1 • Based on Hoeffding Inequality, for , we have ² > 0 ² > 0 P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) • As is a finite set, it satisfies F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g [ [ P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ f R ( f ) ¡ ^ f R ( f ) ¡ ^ R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² g ) R ( f ) ¸ ² g ) f 2F f 2F X X P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ · · R ( f ) ¸ ² ) R ( f ) ¸ ² ) f 2F f 2F · d exp( ¡ 2 N² 2 ) · d exp( ¡ 2 N² 2 )

  20. Proof of Generalized Error Bound • Equivalence statements P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) m P ( 8 f 2 F : R ( f ) ¡ ^ P ( 8 f 2 F : R ( f ) ¡ ^ R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) • Then setting r r 2 N log d 2 N log d 1 1 ± = d exp( ¡ 2 N² 2 ) ± = d exp( ¡ 2 N² 2 ) , , ² = ² = ± ± The generalized error is bounded with the probability P ( R ( f ) < ^ P ( R ( f ) < ^ R ( f ) + ² ) ¸ 1 ¡ ± R ( f ) + ² ) ¸ 1 ¡ ±

  21. For Infinite Hypothesis Space • Many hypothesis classes, including any parameterized by real numbers actually contain an infinite number of functions • E.g., linear models, neural networks f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 ) f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 )

Recommend


More recommend