CS420, Machine Learning, Lecture 10 Learning Theory and Model Selection Weinan Zhang Shanghai Jiao Tong University http://wnzhang.net http://wnzhang.net/teaching/cs420/index.html
Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection
Learning Theory • Theorems that characterize classes of learning problems or specific algorithms in terms of computational complexity or sample complexity • i.e. the number of training examples necessary or sufficient to learn hypotheses of a given accuracy r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness
Learning Theory • Complexity of a learning problem depends on: • Size or expressiveness of the hypothesis space • Accuracy to which target concept must be approximated • Probability with which the learner must produce a successful hypothesis • Manner in which training examples are presented, e.g. randomly or by query to an oracle r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± Error #. Training Hypothesis Probability samples space of correctness
Model Selection • Which model is the best? Linear model: underfitting 4 th -order model: well fitting 15 th -order model: overfitting • Underfitting occurs when a statistical model or machine learning algorithm cannot capture the underlying trend of the data. • Overfitting occurs when a statistical model describes random error or noise instead of the underlying relationship
Regularization • Add a penalty term of the parameters to prevent the model from overfitting the data N N X X 1 1 min min L ( y i ; f μ ( x i )) + ¸ Ð( μ ) L ( y i ; f μ ( x i )) + ¸ Ð( μ ) N N μ μ i =1 i =1
Content • Learning Theory • Bias-Variance Decomposition • Finite Hypothesis Space ERM Bound • Infinite Hypothesis Space ERM Bound • VC Dimension • Model Selection • Cross Validation • Feature Selection • Occam’s Razor for Bayesian Model Selection
Bias Variance Decomposition
Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 Err( x 0 ) = E [( Y ¡ ^ Err( x 0 ) = E [( Y ¡ ^ f ( X )) 2 j X = x 0 ] f ( X )) 2 j X = x 0 ] = E [( ² + f ( x 0 ) ¡ ^ = E [( ² + f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ = E [ ² 2 ] + E [2 ² ( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ + E [( f ( x 0 ) ¡ ^ f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 ))] f ( x 0 ))] | | {z {z } } =0 =0 ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] + E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] ¡ 2 E [( f ( x 0 ) ¡ E [ ^ ¡ 2 E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )])( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ f ( x 0 ))] f ( x 0 ))] ² + E [( f ( x 0 ) ¡ E [ ^ ² + E [( f ( x 0 ) ¡ E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )]) 2 ] + E [( E [ ^ f ( x 0 )] ¡ ^ f ( x 0 )] ¡ ^ = ¾ 2 = ¾ 2 f ( x 0 )) 2 ] f ( x 0 )) 2 ] f ( x 0 )] 2 + E [ ^ f ( x 0 )] 2 + E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ ¡ 2 ( f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ f ( x 0 ) E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] ¡ E [ ^ f ( x 0 )] 2 ) f ( x 0 )] 2 ) | | {z {z } } =0 =0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ = ¾ 2 = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] = ¾ 2 = ¾ 2 ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ f ( x 0 )) f ( x 0 ))
Bias-Variance Decomposition • Bias-Variance Decomposition • Assume where ² » N (0 ; ¾ 2 ² » N (0 ; ¾ 2 Y = f ( X ) + ² Y = f ( X ) + ² ² ) ² ) • Then the expected prediction error at an input point x 0 f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ f ( x 0 )] ¡ f ( x 0 )) 2 + E [( ^ ² + ( E [ ^ ² + ( E [ ^ f ( x 0 ) ¡ E [ ^ f ( x 0 ) ¡ E [ ^ Err( x 0 ) = ¾ 2 Err( x 0 ) = ¾ 2 f ( x 0 )]) 2 ] f ( x 0 )]) 2 ] ² + Bias 2 ( ^ ² + Bias 2 ( ^ f ( x 0 )) + Var( ^ f ( x 0 )) + Var( ^ = ¾ 2 = ¾ 2 f ( x 0 )) f ( x 0 )) Observation How far How uncertain away the the prediction is noise (Irreducible expected (given different error) prediction is training settings from the e.g. data and truth initialization)
Illustration of Bias-Variance High Bias Low f ( x ) f ( x ) ^ ^ f ( x ) f ( x ) High Regularization Low Low Variance High Figures provided by Max Welling
Illustration of Bias-Variance regularization • Training error measures bias, but ignores variance. • Testing error / cross-validation error measures both bias and variance. Figures provided by Max Welling
Bias-Variance Decomposition • Schematic of the behavior of bias and variance Closest fit in population Realization Closest fit Truth MODEL SPACE Regularized fit Model bias Estimation bias RESTRICED MODEL SPACE Estimation Variance Slide credit Liqing Zhang
Hypothesis Space ERM Bound Empirical Risk Minimization Finite Hypothesis Space Infinite Hypothesis Space
Machine Learning Process Training Raw Model Data Data Data Formaliz- Evaluation ation Raw Test Data Data • After selecting ‘good’ hyperparameters, we train the model over the whole training data and the model can be used on test data.
Generalization Ability • Generalization Ability is the model prediction capacity on unobserved data • Can be evaluated by Generalization Error, defined by Z Z R ( f ) = E [ L ( Y; f ( X ))] = R ( f ) = E [ L ( Y; f ( X ))] = L ( y; f ( x )) p ( x; y ) dxdy L ( y; f ( x )) p ( x; y ) dxdy X £ Y X £ Y • where is the underlying (probably unknown) p ( x; y ) p ( x; y ) joint data distribution • Empirical estimation of GA on a training dataset is N N X X R ( f ) = 1 R ( f ) = 1 ^ ^ L ( y i ; f ( x i )) L ( y i ; f ( x i )) N N i =1 i =1
A Simple Case Study on Generalization Error • Finite hypothesis set F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g • Theorem of generalization error bound: For any function , with probability no less f 2 F f 2 F than , it satisfies 1 ¡ ± 1 ¡ ± R ( f ) · ^ R ( f ) · ^ R ( f ) + ² ( d; N; ± ) R ( f ) + ² ( d; N; ± ) where r r ³ ³ ´ ´ 1 1 log d + log 1 log d + log 1 ² ( d; N; ± ) = ² ( d; N; ± ) = 2 N 2 N ± ± • N : number of training instances • d: number of functions in the hypothesis set Section 1.7 in Dr. Hang Li’s text book.
Lemma: Hoeffding Inequality Let be bounded independent random X 1 ; X 2 ; : : : ; X N X 1 ; X 2 ; : : : ; X N variables , the average variable Z is X i 2 [ a; b ] X i 2 [ a; b ] N N X X Z = 1 Z = 1 X i X i N N i =1 i =1 Then the following inequalities satisfy: μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( Z ¡ E [ Z ] ¸ t ) · exp P ( Z ¡ E [ Z ] ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 μ ¡ 2 Nt 2 μ ¡ 2 Nt 2 ¶ ¶ P ( E [ Z ] ¡ Z ¸ t ) · exp P ( E [ Z ] ¡ Z ¸ t ) · exp ( b ¡ a ) 2 ( b ¡ a ) 2 http://cs229.stanford.edu/extra-notes/hoeffding.pdf
Proof of Generalized Error Bound • For binary classification, the error rate 0 · R ( f ) · 1 0 · R ( f ) · 1 • Based on Hoeffding Inequality, for , we have ² > 0 ² > 0 P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · exp( ¡ 2 N² 2 ) • As is a finite set, it satisfies F = f f 1 ; f 2 ; : : : ; f d g F = f f 1 ; f 2 ; : : : ; f d g [ [ P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ f R ( f ) ¡ ^ f R ( f ) ¡ ^ R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² ) = P ( R ( f ) ¸ ² g ) R ( f ) ¸ ² g ) f 2F f 2F X X P ( R ( f ) ¡ ^ P ( R ( f ) ¡ ^ · · R ( f ) ¸ ² ) R ( f ) ¸ ² ) f 2F f 2F · d exp( ¡ 2 N² 2 ) · d exp( ¡ 2 N² 2 )
Proof of Generalized Error Bound • Equivalence statements P ( 9 f 2 F : R ( f ) ¡ ^ P ( 9 f 2 F : R ( f ) ¡ ^ R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) R ( f ) ¸ ² ) · d exp( ¡ 2 N² 2 ) m P ( 8 f 2 F : R ( f ) ¡ ^ P ( 8 f 2 F : R ( f ) ¡ ^ R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) R ( f ) < ² ) ¸ 1 ¡ d exp( ¡ 2 N² 2 ) • Then setting r r 2 N log d 2 N log d 1 1 ± = d exp( ¡ 2 N² 2 ) ± = d exp( ¡ 2 N² 2 ) , , ² = ² = ± ± The generalized error is bounded with the probability P ( R ( f ) < ^ P ( R ( f ) < ^ R ( f ) + ² ) ¸ 1 ¡ ± R ( f ) + ² ) ¸ 1 ¡ ±
For Infinite Hypothesis Space • Many hypothesis classes, including any parameterized by real numbers actually contain an infinite number of functions • E.g., linear models, neural networks f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = μ 0 + μ 1 x 1 + μ 2 x 2 f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 ) f ( x ) = ¾ ( W 3 ( W 2 tanh( W 1 x + b 1 ) + b 2 ) + b 3 )
Recommend
More recommend