a new information criterion a new information criterion
play

A New Information Criterion A New Information Criterion for the - PowerPoint PPT Presentation

1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa 2


  1. 1 A New Information Criterion A New Information Criterion for the Selection of Subspace Models for the Selection of Subspace Models Department of Computer Science, Tokyo Institute of Technology, Japan Masashi Sugiyama Hidemitsu Ogawa

  2. 2 Function Approximation Function Approximation f x target function ( ) y ˆ f x learning result ( ) y y x y : sample point 2 m 3 1 y : sample value m = + y f x n ( ) m m m x x x x 3 1 2 ˆ f x f x Obtain the optimal approximat ion ( ) to ( ) { } M x m ,y by using the training examples . m = m 1

  3. 3 Model Model Generally, function approximation is performed by estimating parameters of a prefixed set of functions called a model. polynomial 3 - layer neural networks N N ∑ ∑ ˆ = = σ n ˆ f x a n x f x a x b ( ) ( ) ( ; ) n n = = n n 0 1 The choice of the model complexity (e.g. order of polynomial, number of units) is crucial for optimal generalization.

  4. 4 Model Selection Model Selection Simple model Appropriate model Complex model Target function Learning result Select the best model providing the optimal generalization capability.

  5. 5 Motivation and goal Motivation and goal Most of the traditional model selection criteria do not work well when the number of training examples is small. e.g. AIC (Akaike, 1974), BIC (Schwarz, 1978), MDL (Rissanen, 1978), NIC (Murata, Yoshizawa, & Amari, 1994) Devise a model selection criterion which works well even when the number of training examples is small.

  6. 6 Setting Setting H f : learning target function θ : model f S θ θ S : family of functions indicated by model θ θ ˆ f : learning result function by model θ ˆ f ˆ H f, S , f θ : Hilbert space including and θ θ Select, from a set of models, the model minimizing 2 − ˆ E n f f θ E : expectatio n over noise n

  7. 7 Least mean squares (LMS) learning Least mean squares (LMS) learning LMS learning is aimed at minimizing the training error M ∑ 2 − ˆ f x y ( ) θ m m = m 1 θ ˆ f The LMS learning result function is given as + ( ) ⎛ ⎞ M ∑ = = ⊗ ˆ ⎜ ⎟ f X y X e K x x : ( , ) θ θ θ θ m m ⎝ ⎠ = m 1 + − = y y y y L : Moore Penrose generalize d inverse ( , , , ) M 1 2 ( ) ⊗ − f g e M : Neumann Schatten product : m - th standard basis in C m ( ) ⊗ = ′ f g h h g f K x x S , ( , ) : reproducin g kernel of θ θ

  8. 8 Assumptions (1) Assumptions (1) The mean noise is zero. 2 σ I The noise covariance matrix is given as . 2 σ is generally unknown.

  9. 9 Assumptions (2) Assumptions (2) . ˆ f One of the models gives an unbiased learning result u = = ˆ ˆ E f f f X y : n u u u + ( ) ⎛ ⎞ M = ∑ { } M ⎜ ⊗ ⎟ K x x H X e K x x If ( , ) span , then ( , ) H m = u m H m m 1 ⎝ ⎠ = m 1 ′ K H x x H ( , ) : reproducin g kernel of { } ( ) M ≥ K x x H M H Roughly speaking, ( , ) span if dim H m = m 1 M : the number of training examples

  10. 1 0 Generalization error and bias/variance Generalization error and bias/variance 2 2 2 − = − + − ˆ ˆ ˆ ˆ E f f E f f E f E f θ θ θ θ n n n n generalization bias variance error E : expectatio n over noise n E n ˆ f bias θ f variance generalization ˆ f error θ

  11. 1 1 Estimation of bias Estimation of bias 2 2 ˆ − = ˆ − ˆ − ˆ − − 2 E n f f f f E f f X n X n 2 Re , θ θ θ u n 0 0 E E n n bias ( ) 2 ≈ ˆ − ˆ − − σ f f X X 2 * 0 tr θ u 0 0 ( ) = − = T σ X X X n n ,n , ,n X X 2 * L , , : noise variance, : adjoint operator of θ u M 0 1 2 0 0 ˆ f X u u y E n ˆ f f θ = E ˆ n f u ˆ f X θ θ

  12. 1 2 Estimation of noise variance Estimation of noise variance ( ) ( ) 2 2 ˆ − ≈ ˆ − ˆ − σ + σ E f f f f X X X X 2 * 2 * tr tr θ θ θ θ n u 0 0 generalization bias estimate variance error σ = − X X X X X 2 * : noise variance, , : adjoint operator of θ u 0 = ∑ 2 M − ˆ f x y ( ) u m m = m σ 1 2 ˆ ( ) − M H dim σ σ 2 2 ˆ is an unbiased estimate of

  13. 1 3 Subspace Information Criterion Subspace Information Criterion (SIC) (SIC) From a set of models, select the model minimizing the following SIC. ( ) ( ) 2 = − − σ + σ ˆ ˆ f f X X X X 2 * 2 * ˆ ˆ SIC tr tr θ θ θ u 0 0 The model minimizing SIC is called the minimum SIC model (MSIC model). MSIC model is expected to provide the optimal generalization capability.

  14. 1 4 Validity of SIC Validity of SIC SIC gives an unbiased estimate of the generalization error: 2 = ˆ − E E f f SIC θ n n E : expectatio n over noise n cf. AIC gives an asymptotic unbiased estimate of the generalization error. SIC will work well even when the number of training examples is small.

  15. 1 5 Illustrative Simulation Illustrative Simulation = + − − + f x x x x x x ( ) 2 sin 2 2 cos 2 sin 2 2 2 cos 2 2 sin 3 − + − + − x x x x x 2 cos 3 2 2 sin 4 2 cos 4 2 sin 5 2 cos 5 π π m 2 = − π − + = + x y f x n , ( ) m m m m M M n N : subject to ( 0 , 3 ) m { } S S S L compared models : , , , 1 2 20 { } N S nx nx : Hilbert space spaned by 1 , sin , cos = N n 1 [ ] − π , π defined on

  16. 1 6 Compared model selection criteria Compared model selection criteria • SIC = = H S H : dim( ) 41 20 • Network information criterion (NIC) (Murata, Yoshizawa, & Amari , 1994) A generalized AIC In this simulation, SIC and NIC are fairly compared. 1 π 2 2 ∫ = − = − ˆ ˆ f f f x f x dx Error ( ) ( ) π − π 2

  17. 1 7 = M 200 = S Optimal model (Error 0 . 11 ) 5 = = S S MSIC model (Error 0 . 11 ) MNIC model (Error 0 . 17 ) 5 6

  18. 1 8 = M 100 = S Optimal model (Error 0 . 37 ) 5 = = S S MSIC model (Error 0 . 37 ) MNIC model (Error 0 . 75 ) 5 9

  19. 1 9 = M 50 = S Optimal model (Error 0 . 98 ) 5 SIC works well even when M is small. = = S S MSIC model (Error 0 . 98 ) MNIC model (Error 3 . 36 ) 5 20

  20. 2 0 Unrealizable case Unrealizable case { } { } 200 M h M y Estimate a chaotic series from sample values p m = = m 1 p 1 = M 100

  21. 2 1 Estimation of chaotic series Estimation of chaotic series 2 = − + − x p Consider sample point 0 . 995 ( 1 ) p 200 { } 200 h correspond ing to the chaotic series p = p 1 ⎛ ⎞ 2 = − + − ˆ ˆ ⎜ ⎟ h f p h 0 . 995 ( 1 ) is an estimate of p p ⎝ ⎠ 200 200 ∑ 2 = − ˆ h h Error p p = p 1 We perform the simulation 1000 times.

  22. 2 2 Compared model selection criteria Compared model selection criteria • SIC = = H S H : dim( ) 41 40 • NIC log loss is adopted as the loss function. { } M x 1 are regarded as uniformly distribute d. m = m { } S S S S S S Compared models : , , , , , 15 20 25 30 35 40 { } N S x : Hilbert space spaned by = N n 0 [ ] − defined on 1 , 1

  23. 3 2 0021 0022 Mean Mean . . 0 0 NIC SIC 250 = M

  24. 2 4 = M 150 Mean 0 . 0058 SIC Mean NIC 0 . 013

  25. 2 5 = M 50 Mean 0 . 018 SIC SIC works well even when M is small. Mean NIC 0 . 040

  26. 2 6 Conclusions Conclusions • We proposed a new model selection criterion named the subspace information criterion (SIC) . • SIC gives an unbiased estimate of the generalization error. • SIC works well even when the number of training examples is small.

Recommend


More recommend