ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa
ICANNGA2001 April 25, 2001. 2 Supervised Learning Supervised Learning Target function f ( x ) ˆ x Learning result f ( ) y y L 2 1 y = + ε M y f ( x ) m m m ε : mean 0 m σ 2 variance L x x x u 1 2 M { } M From training examples , obtain x m y , = m m 1 ˆ x f ( ) that minimizes generalization error : J [ ] G ∫ 2 = − ˆ J G E f ( u ) f ( u ) p ( u ) du Expectatio n over noise Future, test input points u E : p ( u )
ICANNGA2001 April 25, 2001. 3 For the Time Being, We Assume… For the Time Being, We Assume… � Target function is linear combination of ( x ) f { } µ ϕ specified basis functions : i x ( ) = i 1 µ ∑ = θ ϕ f ( x ) ( x ) i i = i 1 � Correlation matrix of future input is known. u U ∫ = ϕ ϕ ( ) ( ) ( ) U u u p u du ij i j Later, we will discuss the case when these assumptions do not hold.
ICANNGA2001 April 25, 2001. 4 Subset Regression Models Subset Regression Models ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) S S i i ∈ i S µ K Subset of indices S : { 1 , 2 , , } µ # of basis functions : θ ˆ is determined so that S ( ) M ∑ 2 − ˆ training error is minimized. f ( x ) y S m m = m 1 = − T 1 T X ( A A ) A S S S S θ S = ˆ X S y ϕ ∈ ⎧ ( x ) : i S = i m ⎨ [ A ] ∉ S mi ⎩ 0 : i S = T K y ( y , y , , y ) 1 2 M
ICANNGA2001 April 25, 2001. 5 Model Selection Model Selection Select the best subset of basis functions so that generalization error is minimized: J [ ] G ∫ 2 = ˆ − J G E f ( u ) f ( u ) p ( u ) du However, includes unknown target function . f ( x ) J G We derive an estimate of called J G the subspace information criterion (SIC), and model is determined so that SIC is minimized.
ICANNGA2001 April 25, 2001. 6 Key Idea: Unbiased Estimate Key Idea: Unbiased Estimate µ ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) Largest model u u i i = i 1 θ ˆ : Minimum training error estimate u = − T 1 T X ( A A ) A θ u = ˆ u X u y = ϕ [ A ] ( x ) mi i m = T K y ( y , y , , y ) 1 2 M θ θ ˆ is an unbiased estimate of true parameter : u θ = θ E ˆ u Expectatio n over noise E : θ θ ˆ ˆ is used for estimating generalization error of . u S
ICANNGA2001 April 25, 2001. 7 Bias / Variance Decomposition Bias / Variance Decomposition [ ] = ∫ ⎛ ⎞ 2 2 = θ − θ − ˆ ˆ ⎜ ⎟ J E E f ( u ) f ( u ) p ( u ) du ⎝ ⎠ G S S U 2 2 = θ − θ + θ − θ ˆ ˆ ˆ E E E θ U = 2 θ θ T U S S S U U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j Variance Bias J B [ S ] J V [ S ] E θ ˆ θ S Bias Variance θ ˆ Expectatio n over noise E : S
ICANNGA2001 April 25, 2001. 8 Unbiased Estimate of Variance Unbiased Estimate of Variance 2 = θ − θ ˆ ˆ [ ] J S E E V S S θ S = ˆ U ( ) X S y = σ T = 2 T K trace X X ( , , , ) y y y y 1 2 M S S 2 σ → σ = θ − − µ ˆ 2 2 ˆ ( ) A y M u ( ) = σ ˆ T 2 ˆ J [ S ] trace X X V S S = ˆ E J J V V Expectatio n over noise : E
ICANNGA2001 April 25, 2001. 9 Unbiased Estimate of Bias Unbiased Estimate of Bias 2 = T = θ − θ K ˆ z ( f ( x ), f ( x ), , f ( x )) J [ S ] E 1 2 M ε = ε ε ε B S T K ( , , , M ) 1 2 = − 2 = θ − θ − ε − ε 2 X X X ˆ ˆ 2 X z , X X 0 S u S u 0 0 0 σ → σ 2 2 ˆ E , E ( ) 2 = θ − θ − − σ ˆ ˆ ˆ T 2 ˆ J [ S ] 0 trace X X 0 0 B S u = ˆ E J J B B ← θ ˆ X u y E θ ˆ u Bias S = E θ θ ˆ u → θ ˆ X S y Rough estimate S
ICANNGA2001 April 25, 2001. 10 Subspace Information Criterion Subspace Information Criterion (SIC) (SIC) = + ˆ ˆ = − [ ] [ ] [ ] SIC S J S J S X X X 0 S u B V ( ) ( ) 2 = θ − θ − σ + σ ˆ ˆ 2 T 2 T ˆ ˆ trace X X trace X X S u 0 0 S S SIC is an unbiased estimate of generalization error with finite samples. J G = E SIC [ S ] J [ S ] G [ ] 2 ∫ = ˆ − [ ] ( ) ( ) ( ) J S E f u f u p u du G S
ICANNGA2001 April 25, 2001. 11 When Assumptions Do Not Hold (1) When Assumptions Do Not Hold (1) f ( x ) When target function is not { } µ ϕ included in … L i x ( ) = i 1 [ ] ∫ 2 = − ˆ J E f ( u ) f ( u ) p ( u ) du G S 2 = θ − θ + ˆ * const. E S U { } µ θ ϕ * Best estimate in : L i x ( ) = i 1 2 θ − θ ˆ * SIC is an asymptotic unbiased estimate of : E S U 2 → θ − θ → ∞ ˆ * as E SIC E M . S U # of training samples M :
ICANNGA2001 April 25, 2001. 12 When Assumptions Do Not Hold (2) When Assumptions Do Not Hold (2) When correlation matrix is not available… U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j ′ � If unlabeled samples are available, M { } u = 1 m m (samples without output values) ′ M ∑ = ϕ ϕ ˆ 1 U ( u ) ( u ) ′ ij i m j m M = m 1 M � Training samples are used instead, { } x = m m 1 SIC essentially agrees with Mallows’s C P . U = ˆ � Just I Identity matrix : I
ICANNGA2001 April 25, 2001. 13 Computer Simulation Computer Simulation 50 ( ) ∑ � = + 1 Target function : f ( x ) sin px cos px 10 = p 1 − π π � Randomly created in x : [ , ] m = + ε ε σ � 2 : is subject to y f ( x ) N ( 0 , ) m m m m { } µ = � 100 Basis functions : 201 1 , sin px , cos px = p 1 { } � K Compared models : S , S , , S 0 10 100 { } n S : 1 , sin px , cos px π 2 = ∫ n p 1 = − ˆ � Error 1 f ( u ) f ( u ) du π 2 − π
ICANNGA2001 April 25, 2001. 14 Compared Methods Compared Methods � SIC � Mallows’ C P � Leave-one-out cross-validation (CV) � Akaike’s information criterion (AIC) � Sugiura’s corrected AIC (cAIC) � Schwarz’s Bayesian information criterion (BIC) � Vapnik’s measure (VM) Simulations are performed 100 times with σ = 2 ( , M ) ( 250 , 0 . 6 ), ( 500 , 0 . 6 ), ( 250 , 0 . 2 ), ( 500 , 0 . 2 ) σ 2 # of training samples Noise variance : M :
ICANNGA2001 April 25, 2001. 15 Easiest Case (Curve) Easiest Case (Curve) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )
ICANNGA2001 April 25, 2001. 16 Easiest Case (Error) Easiest Case (Error) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )
ICANNGA2001 April 25, 2001. 17 Hardest Case Hardest Case σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )
ICANNGA2001 April 25, 2001. 18 Summary of Simulations Summary of Simulations SIC C P CV AIC σ 2 M cAIC BIC VM Works well Selects smaller Selects larger σ 2 # of training samples Noise variance : M :
ICANNGA2001 April 25, 2001. 19 µ ϕ µ ϕ When is not in . ( x ) f L { i x ( )} When is not in . ( x ) f L { i x ( )} = = i 1 i 1 Interpolation of chaotic series Similar results were obtained !!
ICANNGA2001 April 25, 2001. 20 Conclusions Conclusions � We proposed a new model selection criterion called subspace information criterion (SIC). � SIC gives an unbiased estimate of generalization error. � Computer simulations showed that SIC works well with small samples and large noise.
Recommend
More recommend