COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University

M ODEL S ELECTION

M ODEL S ELECTION The model selection problem We’ve seen how often model parameters need to be set in advance and discussed how this can be done using using cross-validation. Another type of model selection problem is learning model order. Model order : The complexity of a class of models ◮ Gaussian mixture model: How many Gaussians? ◮ Matrix factorization: What rank? ◮ Hidden Markov models: How many states? In each of these problems, we can’t simply look at the log-likelihood because a more complex model can always fit the data better.

M ODEL S ELECTION Model Order We will discuss two methods for selecting an “appropriate” complexity of the model. This assumes a good model type was chosen to begin with.

E XAMPLE : M AXIMUM LIKELIHOOD Notation We write L for the log-likelihood of a parameter under a model p ( x | θ ) : N � iid ∼ p ( x | θ ) ⇐ ⇒ L = log p ( x i | θ ) x i i = 1 The maximum likelihood solution is: θ ML = arg max θ L . Example: How many clusters? (wrong way) The parameters θ could be those of a GMM. We could find θ ML for different numbers of clusters and pick the one with the largest L . Problem : We can perfectly fit the data by putting each observation in its own cluster. Then shrink the variance of each Gaussian to zero.

N UMBER OF P ARAMETERS The general problem ◮ Models with more degrees of freedom are more prone to overfitting. ◮ The degrees of freedom is roughly the number of scalar parameters, K . ◮ By increasing K (done by increasing # clusters, rank, # states, etc.) the model can add more degrees of freedom. Some common solutions ◮ Stability : Bootstrap sample the data, learn a model, calculate the likelihood on the original data set. Repeat and pick the best model. ◮ Bayesian nonparametric methods : Each possible value of K is assigned a prior probability. The posterior learns the best K . ◮ Penalization approaches : A penalty term makes adding parameters expensive. Must be overcome by a greater improvement in likelihood.

P ENALIZING MODEL COMPLEXITY General form Define a penalty function on the number of model parameters. Instead of maximizing L , minimize −L and add the defined penalty. Two popular penalties are: ◮ Akaike information criterion (AIC) : − L + K ◮ Bayesian information criterion (BIC) : − L + 1 2 K ln N When 1 2 ln N > 1, BIC encourages a simpler model (happens when N ≥ 8). Example : For NMF with an M 1 × M 2 matrix and rank R factorization, BIC → 1 AIC → ( M 1 + M 2 ) R , 2 ( M 1 + M 2 ) R ln ( M 1 M 2 )

E XAMPLE OF AIC OUTPUT

E XAMPLE : AIC VS BIC ON HMM Notice: ◮ Likelihood is always improving ◮ Only compare location of AIC and BIC minima, not the values.

D ERIVATION OF BIC

AIC AND BIC Recall the two penalties: ◮ Akaike information criterion (AIC) : − L + K ◮ Bayesian information criterion (BIC) : − L + 1 2 K ln N Algorithmically, there is no extra work required: 1. Find the ML solution of the selected models and calculate L . 2. Add the AIC or BIC penalty to get a score useful for picking a model. Q: Where do these penalties come from? Currently they seem arbitrary. A: We will derive BIC next. AIC also has a theoretical motivation, but we will not discuss that derivation.

D ERIVING THE BIC Imagine we have r candidate models, M 1 , . . . , M r . For example, r HMMs each having a different number of states. We also have data D = { x 1 , . . . , x N } . We want the posterior of each M i . p ( D|M i ) p ( M i ) p ( M i |D ) = � j p ( D|M j ) p ( M j ) If we assume a uniform prior distribution on models, then because the denominator is constant in M i , we can pick � M = arg max M i ln p ( D|M i ) = ln p ( D| θ, M i ) p ( θ |M i ) d θ We’re choosing the model with the largest marginal likelihood of the data by integrating out all parameters of the model. This is usually not solvable.

D ERIVING THE BIC We will see how the BIC arises from the approximation, M i ln p ( D| θ ML , M i ) − 1 M = arg max M i ln p ( D|M i ) ≈ arg max 2 K ln N Step 1 : Recognize that the difficulty is with the integral � ln p ( D|M i ) = ln p ( D| θ ) p ( θ ) d θ. M i determines p ( D| θ ) , p ( θ ) —we will suppress this conditioning. Step 2 : Approximate this integral using a second-order Taylor expansion.

D ERIVING THE BIC 1 . We want to calculate: � � ln p ( D|M ) = ln p ( D| θ ) p ( θ ) d θ = ln exp { ln p ( D| θ ) } p ( θ ) d θ We use a second-order Taylor expansion of ln p ( D| θ ) at the point θ ML , 2 . ln p ( D| θ ML ) + ( θ − θ ML ) T ∇ ln p ( D| θ ML ) ln p ( D| θ ) ≈ � �� = 0 + 1 2 ( θ − θ ML ) T ∇ 2 ln p ( D| θ ML ) ( θ − θ ML ) � �� = −J ( θ ML ) 3 . Approximate p ( θ ) as uniform and plug this approximation back in, � � � − 1 2 ( θ − θ ML ) T J ( θ ML )( θ − θ ML ) ln p ( D|M ) ≈ ln p ( D| θ ML ) + ln exp d θ

D ERIVING THE BIC Observation : The integral is the normalizing constant of a Gaussian, � � � K / 2 � � 2 π − 1 2 ( θ − θ ML ) T J ( θ ML )( θ − θ ML ) exp d θ = |J ( θ ML ) | Remember the definition that N 1 � −J ( θ ML ) = ∇ 2 ln p ( D| θ ML ) ( a ) N ∇ 2 ln p ( x i | θ ML ) = N i = 1 � �� converges as N increases (a) is by the i.i.d. model assumption made at the beginning of the lecture.

S OME NEXT STEPS

ICML S ESSIONS ( SUBSET ) The International Conference on Machine Learning (ICML) is a major ML conference. Many of the session titles should look familiar: ◮ Bayesian Optimization and Gaussian Processes ◮ PCA and Subspace Models ◮ Supervised Learning ◮ Matrix Completion and Graphs ◮ Clustering and Nonparametrics ◮ Active Learning ◮ Clustering ◮ Boosting and Ensemble Methods ◮ Matrix Factorization I & II ◮ Kernel Methods I & II ◮ Topic models ◮ Time Series and Sequences ◮ etc.

ICML S ESSIONS ( SUBSET ) Other sessions might not look so familiar: ◮ Reinforcement Learning I & II ◮ Bandits I & II ◮ Optimization I, II & III ◮ Bayesian nonparametrics I & II ◮ Online learning I & II ◮ Graphical Models I & II ◮ Neural Networks and Deep Learning I & II ◮ Metric Learning and Feature Selection ◮ etc. Many of these topics are taught in advanced machine learning courses at Columbia in the CS, Statistics, IEOR and EE departments.

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 - PowerPoint PPT Presentation

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department of Electrical Engineering & Data Science Institute Columbia University M ODEL S ELECTION M ODEL S ELECTION The model selection problem

Introduction to machine learning COMS 4721 Learning from data Machine learning : the study

COMS 4721: Machine Learning for Data Science Lecture 18, 4/4/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 14, 3/21/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 20, 4/11/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 3, 1/24/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 15, 3/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 4, 1/26/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 16, 3/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 19, 4/6/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 13, 3/2/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 5, 1/31/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 12, 2/28/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 11, 2/23/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 23, 4/20/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 8, 2/14/2017 Prof. John Paisley Department

COMS 4721: Machine Learning for Data Science Lecture 17, 3/30/2017 Prof. John Paisley Department

Information leakage from black holes with symmetry Yoshifumi NAKATA Kyoto university E.

Transverse Spin Asymmetries in Neutral Strange Particle Production Thomas Burton Wed 3rd June

Epistemic Diversity and Editor Decisions: A Statistical Matthew Effect Remco Heesen 1 Jan-Willem

Information Information partition Player 's information partition is a collection of his

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon