Model Selection Model Selection with Small Samples with Small - PowerPoint PPT Presentation

ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa

ICANNGA2001 April 25, 2001. 2 Supervised Learning Supervised Learning Target function f ( x ) ˆ x Learning result f ( ) y y L 2 1 y = + ε M y f ( x ) m m m ε : mean 0 m σ 2 variance L x x x u 1 2 M { } M From training examples , obtain x m y , = m m 1 ˆ x f ( ) that minimizes generalization error : J [ ] G ∫ 2 = − ˆ J G E f ( u ) f ( u ) p ( u ) du Expectatio n over noise Future, test input points u E : p ( u )

ICANNGA2001 April 25, 2001. 3 For the Time Being, We Assume… For the Time Being, We Assume… � Target function is linear combination of ( x ) f { } µ ϕ specified basis functions : i x ( ) = i 1 µ ∑ = θ ϕ f ( x ) ( x ) i i = i 1 � Correlation matrix of future input is known. u U ∫ = ϕ ϕ ( ) ( ) ( ) U u u p u du ij i j Later, we will discuss the case when these assumptions do not hold.

ICANNGA2001 April 25, 2001. 4 Subset Regression Models Subset Regression Models ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) S S i i ∈ i S µ K Subset of indices S : { 1 , 2 , , } µ # of basis functions : θ ˆ is determined so that S ( ) M ∑ 2 − ˆ training error is minimized. f ( x ) y S m m = m 1 = − T 1 T X ( A A ) A S S S S θ S = ˆ X S y ϕ ∈ ⎧ ( x ) : i S = i m ⎨ [ A ] ∉ S mi ⎩ 0 : i S = T K y ( y , y , , y ) 1 2 M

ICANNGA2001 April 25, 2001. 5 Model Selection Model Selection Select the best subset of basis functions so that generalization error is minimized: J [ ] G ∫ 2 = ˆ − J G E f ( u ) f ( u ) p ( u ) du However, includes unknown target function . f ( x ) J G We derive an estimate of called J G the subspace information criterion (SIC), and model is determined so that SIC is minimized.

ICANNGA2001 April 25, 2001. 6 Key Idea: Unbiased Estimate Key Idea: Unbiased Estimate µ ∑ = θ ϕ ˆ ˆ f ( x ) [ ] ( x ) Largest model u u i i = i 1 θ ˆ : Minimum training error estimate u = − T 1 T X ( A A ) A θ u = ˆ u X u y = ϕ [ A ] ( x ) mi i m = T K y ( y , y , , y ) 1 2 M θ θ ˆ is an unbiased estimate of true parameter : u θ = θ E ˆ u Expectatio n over noise E : θ θ ˆ ˆ is used for estimating generalization error of . u S

ICANNGA2001 April 25, 2001. 7 Bias / Variance Decomposition Bias / Variance Decomposition [ ] = ∫ ⎛ ⎞ 2 2 = θ − θ − ˆ ˆ ⎜ ⎟ J E E f ( u ) f ( u ) p ( u ) du ⎝ ⎠ G S S U 2 2 = θ − θ + θ − θ ˆ ˆ ˆ E E E θ U = 2 θ θ T U S S S U U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j Variance Bias J B [ S ] J V [ S ] E θ ˆ θ S Bias Variance θ ˆ Expectatio n over noise E : S

ICANNGA2001 April 25, 2001. 8 Unbiased Estimate of Variance Unbiased Estimate of Variance 2 = θ − θ ˆ ˆ [ ] J S E E V S S θ S = ˆ U ( ) X S y = σ T = 2 T K trace X X ( , , , ) y y y y 1 2 M S S 2 σ → σ = θ − − µ ˆ 2 2 ˆ ( ) A y M u ( ) = σ ˆ T 2 ˆ J [ S ] trace X X V S S = ˆ E J J V V Expectatio n over noise : E

ICANNGA2001 April 25, 2001. 9 Unbiased Estimate of Bias Unbiased Estimate of Bias 2 = T = θ − θ K ˆ z ( f ( x ), f ( x ), , f ( x )) J [ S ] E 1 2 M ε = ε ε ε B S T K ( , , , M ) 1 2 = − 2 = θ − θ − ε − ε 2 X X X ˆ ˆ 2 X z , X X 0 S u S u 0 0 0 σ → σ 2 2 ˆ E , E ( ) 2 = θ − θ − − σ ˆ ˆ ˆ T 2 ˆ J [ S ] 0 trace X X 0 0 B S u = ˆ E J J B B ← θ ˆ X u y E θ ˆ u Bias S = E θ θ ˆ u → θ ˆ X S y Rough estimate S

ICANNGA2001 April 25, 2001. 10 Subspace Information Criterion Subspace Information Criterion (SIC) (SIC) = + ˆ ˆ = − [ ] [ ] [ ] SIC S J S J S X X X 0 S u B V ( ) ( ) 2 = θ − θ − σ + σ ˆ ˆ 2 T 2 T ˆ ˆ trace X X trace X X S u 0 0 S S SIC is an unbiased estimate of generalization error with finite samples. J G = E SIC [ S ] J [ S ] G [ ] 2 ∫ = ˆ − [ ] ( ) ( ) ( ) J S E f u f u p u du G S

ICANNGA2001 April 25, 2001. 11 When Assumptions Do Not Hold (1) When Assumptions Do Not Hold (1) f ( x ) When target function is not { } µ ϕ included in … L i x ( ) = i 1 [ ] ∫ 2 = − ˆ J E f ( u ) f ( u ) p ( u ) du G S 2 = θ − θ + ˆ * const. E S U { } µ θ ϕ * Best estimate in : L i x ( ) = i 1 2 θ − θ ˆ * SIC is an asymptotic unbiased estimate of : E S U 2 → θ − θ → ∞ ˆ * as E SIC E M . S U # of training samples M :

ICANNGA2001 April 25, 2001. 12 When Assumptions Do Not Hold (2) When Assumptions Do Not Hold (2) When correlation matrix is not available… U ∫ = ϕ ϕ U ( u ) ( u ) p ( u ) du ij i j ′ � If unlabeled samples are available, M { } u = 1 m m (samples without output values) ′ M ∑ = ϕ ϕ ˆ 1 U ( u ) ( u ) ′ ij i m j m M = m 1 M � Training samples are used instead, { } x = m m 1 SIC essentially agrees with Mallows’s C P . U = ˆ � Just I Identity matrix : I

ICANNGA2001 April 25, 2001. 13 Computer Simulation Computer Simulation 50 ( ) ∑ � = + 1 Target function : f ( x ) sin px cos px 10 = p 1 − π π � Randomly created in x : [ , ] m = + ε ε σ � 2 : is subject to y f ( x ) N ( 0 , ) m m m m { } µ = � 100 Basis functions : 201 1 , sin px , cos px = p 1 { } � K Compared models : S , S , , S 0 10 100 { } n S : 1 , sin px , cos px π 2 = ∫ n p 1 = − ˆ � Error 1 f ( u ) f ( u ) du π 2 − π

ICANNGA2001 April 25, 2001. 14 Compared Methods Compared Methods � SIC � Mallows’ C P � Leave-one-out cross-validation (CV) � Akaike’s information criterion (AIC) � Sugiura’s corrected AIC (cAIC) � Schwarz’s Bayesian information criterion (BIC) � Vapnik’s measure (VM) Simulations are performed 100 times with σ = 2 ( , M ) ( 250 , 0 . 6 ), ( 500 , 0 . 6 ), ( 250 , 0 . 2 ), ( 500 , 0 . 2 ) σ 2 # of training samples Noise variance : M :

ICANNGA2001 April 25, 2001. 15 Easiest Case (Curve) Easiest Case (Curve) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

ICANNGA2001 April 25, 2001. 16 Easiest Case (Error) Easiest Case (Error) σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

ICANNGA2001 April 25, 2001. 17 Hardest Case Hardest Case σ = 2 ( 250 , 0 . 6 ) ( 500 , 0 . 6 ) ( M , ) ( 250 , 0 . 2 ) ( 500 , 0 . 2 )

ICANNGA2001 April 25, 2001. 18 Summary of Simulations Summary of Simulations SIC C P CV AIC σ 2 M cAIC BIC VM Works well Selects smaller Selects larger σ 2 # of training samples Noise variance : M :

ICANNGA2001 April 25, 2001. 19 µ ϕ µ ϕ When is not in . ( x ) f L { i x ( )} When is not in . ( x ) f L { i x ( )} = = i 1 i 1 Interpolation of chaotic series Similar results were obtained !!

ICANNGA2001 April 25, 2001. 20 Conclusions Conclusions � We proposed a new model selection criterion called subspace information criterion (SIC). � SIC gives an unbiased estimate of generalization error. � Computer simulations showed that SIC works well with small samples and large noise.

Model Selection Model Selection with Small Samples with Small - PowerPoint PPT Presentation

ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa ICANNGA2001 April 25, 2001. 2

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

This graph shows the evidence from the samples giving an indication of the predominance of the

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

First Results with PAWIAN th 2019| P ANDA CM 19/2 GSI | Jennifer Ptz June 25 Outline

Probabilistic Graphical Models Lecture 6 Variable Elimination CS/CNS/EE 155 Andreas Krause

Model inference . Course of Machine Learning Master Degree in Computer Science University of

Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , Zhu (Wisconsin) KDD 12

T-61.3050 Machine Learning: Basic Principles Multivariate Methods Kai Puolam aki Laboratory

Learning Bayesian Networks: Learning Bayesian Networks: Na ve and non ve and non- -Na

Bayes Network Analysis by Program Verification Joost-Pieter Katoen Alan Turing Institute,

Model Selection Model Selection with Small Samples with Small - PowerPoint PPT Presentation

ICANNGA2001 April 25, 2001. 1 Model Selection Model Selection with Small Samples with Small Samples Department of Computer Science, Tokyo Institute of Technology, Japan. Masashi Sugiyama Hidemitsu Ogawa ICANNGA2001 April 25, 2001. 2

Samples Advertising of samples and handing out samples Advertising Education and Assurance

-Samples [AB98] Hyp: domain S is a smooth curve or surface. S 1 -Samples [AB98] Hyp:

Business Statistics CONTENTS Comparing two samples Comparing two unrelated samples Comparing

STAT 113 Independent vs. Paired Samples Colin Reimer Dawson Oberlin College November 16, 2017

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 14, 2019 1 / 125

Biased and Unbiased Samples James J. Heckman Econ 312, Spring 2019 May 13, 2019 1 / 125

ERP Selection KIRTANE &amp; PANDIT Suhas Deshpande Why ERP Selection is important ?

Labeling Blood Samples There are documented occurrences and near misses of mislabeling of blood

This graph shows the evidence from the samples giving an indication of the predominance of the

Many Features, Few Samples: Many Features, Few Samples: From cheminformatics cheminformatics to

MutaPon Analysis in Frozen and FFPE Tumor Samples Gad Getz, PhD KrisPn Ardlie, PhD Broad

SEM Photographs of Activated ash samples SEM Micrographs (Original ash samples) (a) Sample S1F1

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &amp;

Lecture 6: samples and populations Todays lecture Look at fundamental concepts of samples and

Stochastic Simulation Idea: probabilities samples Get probabilities from samples: X count X

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &amp;

Overview Bayesian Model Selection Bayesian Learning of CPTs Dealing with Multiple Models Chris

First Results with PAWIAN th 2019| P ANDA CM 19/2 GSI | Jennifer Ptz June 25 Outline

Probabilistic Graphical Models Lecture 6 Variable Elimination CS/CNS/EE 155 Andreas Krause

Model inference . Course of Machine Learning Master Degree in Computer Science University of

Bayesian Networks Li Xiong Slide credits: Page (Wisconsin) CS760 , Zhu (Wisconsin) KDD 12

T-61.3050 Machine Learning: Basic Principles Multivariate Methods Kai Puolam aki Laboratory

Learning Bayesian Networks: Learning Bayesian Networks: Na ve and non ve and non- -Na

Bayes Network Analysis by Program Verification Joost-Pieter Katoen Alan Turing Institute,

ERP Selection KIRTANE & PANDIT Suhas Deshpande Why ERP Selection is important ?

Inferential Problems with Nonprobability Samples Richard Valliant University of Michigan &

Counting Words: Type probabilities Population models Type-rich populations, samples, ZM &