APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING MACHINE LEARNING Overview 1 1
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Exam Format The exam lasts a total of 3 hours: - Upon entering the room, you must leave you bag, cell phone, etc, in a corner of the room; you are allowed to keep a couple of pen/pencil/ eraser and a few blank sheets of paper. - The exam will be graded anonymously; make sure to have your camipro card with you to write your sciper number on your exam sheet, as we will check your card. Exam is closed book but you can bring one A4 page with personal handwritten notes written recto-verso. 2 2
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING What to know for the exam Formalism / Taxonomy: • You should be capable of giving formal definitions of a pdf, marginal, likelihood. • You should know the difference between supervised / unsupervised learning and be able to give examples of algorithms in each case. Principles of evaluation: • You should know the basic principles of evaluation of ML techniques: training vs. testing sets, cross-validation, ground truth. • You should know the principle of each method of evaluation seen in class and know which method of evaluation to apply where (F-measure in clustering vs. classification, BIC, etc). 3 3
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING What to know for the exam • For each algorithm, be able to explain: – what it can do: classification, regression, structure discovery / reduction of dimensionality – what one should be careful about (limitations of the algorithm, choice of hyperparameters) and how does this choice influence the results. – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs 4 4
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING What to know for the exam • For each algorithm, be able to explain: SVM – what it can do: classification, regression, structure discovery / reduction of dimensionality Performs binary classification; can be extended to multi-class classification; can be extended to regression (SVR) – what one should be careful about (limitations of the algorithm, choice of hyperparameters) e.g. choice of kernel; too small kernel width in Gaussian kernels may lead to over-fitting; – the key steps of the algorithm, its hyperparameters, the variables it takes as input and the variables it outputs 5 5
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Class Overview This overview is meant to highlight similarities and differences across the different methods presented in class. To be well prepared to the exam, read carefully the slides, the exercises and their solutions. 6 6
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Class Overview This class has presented groups of methods for structure discovery, classification and non-linear regression . Structure Discovery Classification PCA SVM, GMM + Bayes & Clustering Techniques K-Means, Regression Soft K-means, GMM SVR GMR 7 7
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Overview: Finding Structure in Data Techniques for finding structure in data proceed by projecting or grouping the data from the original space into another space of lower dimension . The projected space is chosen so as to highlight particular features common to subsets of datapoints. Pre-processing step: The found structure may be exploited in a second stage by another algorithm for regression, classification, etc. 8 8
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Overview: Finding Structure in Data Principal Component Analysis (PCA) x N Y AX q y , q N - Determines what is most common across datapoints. - Projects onto axes that maximize correlation (eigenvectors of covariance matrix) - lower dimensions allow to discriminate across subgroups of datapoints! - Discard dimensions with the smallest eigenvalues. 9 9
APPLIED MACHINE LEARNING Overview: Finding Structure in Data Clustering Methods All three methods for clustering we have seen in class (K-means, soft K- means, GMM) are all solved through E-M (expectation-maximization). You should be able to spell out the similarities and differences across K- means, soft K-means and GMM. - They are similar in their representation of the problem and optimization method. - They differ in the number of parameters to estimate and number of hyper-parameters, etc. 10
APPLIED MACHINE LEARNING Overview: Finding Structure in Data Clustering Methods and Metric of Similarity All clustering methods depend on choosing well a metric of similarity to measure how similar subgroup of data-points are. You should be able to list which metric of similarity can be used in each case and how this choice may impact the clustering. Exponential decreasing function in Likelihood of each Gauss function Lp-norm in K-means soft K-means modulated by the stiffness Can use isotropic/diagonal & full ~= isotropic rbf (unnormalized Gauss) function covariance matrices 11
APPLIED MACHINE LEARNING Clustering versus Classification Fundamental difference between clustering and classification: • Clustering is unsupervised classification • Classification is supervised classification Both use the F-measure but not in the same way. The clustering F-measure assumes a semi-supervised model, in which only a subset of the points are labelled. 12
APPLIED MACHINE LEARNING Semi-Supervised Learning Clustering F1-Measure: (careful: similar but not the same F-measure as the F-measure we will see for classification!) Tradeoff between clustering correctly all datapoints of the same class in the same cluster and making sure that each cluster contains points of only one class. M : nm of labeled datapoints Penalize fraction of labeled C c : the set of classes points in each class i K : nm of clusters, Picks for each class n : nm of members of class c and of cluster k ik i the cluster with the c i F C K , max F c k , maximal F1 measure 1 1 i M c C k i 2 R c k P c k , , i i F c k , Recall : proportion of 1 i R c k , P c k , i i datapoints correctly n ik classified/clusterized R c k , i c i Precision : proportion of n ik P c k , i datapoints of the same k class in the cluster 13 13
APPLIED MACHINE LEARNING Performance Measures Classification F-Measure: (careful: similar but not the same F-measure as the F-measure we saw for clustering!) Tradeoff between classifying correctly all datapoints of the same class and making sure that each class contains points of only one class. True Positives( TP ) : nm of datapoints of class 1 that are correctly classified False Negative ( FN ) : nm of datapoints of class 1 that are incorrectly classified False Positives( FP ) : nm of datapoints of class 2 that are incorrectly classified Recall: Proportion of datapoints TP Recall: correctly classified in Class 1 TP FN TP Precision: Precision : proportion of datapoints of TP FP class 1 correctly classified over all datapoints classified in class 1 2*Precision*Recall F Precision+Recall 14
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Overview: Classification GMM + Bayes SVM Original two-classes 1 Gauss fct per class 7 support vectors But full covariance matrix Non-Linear boundary in both cases. Compute number of parameters required for the same fit. 15 15
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Kernel Methods We have seen two examples of kernel method with SVM/SVR. Kernel Methods implicitly search for structure in the data prior to performing another computation (classification or regression) - The kernel allows to extract non-linear types of correlations. - These methods exploit the Kernel Trick: The kernel trick exploits the observation that all linear methods for finding structure in data are based on computing an inner product across variables. This inner product can be replaced by the kernel function if known. The problem becomes then linear in feature space. k X : X Metric of similarity across datapoints i j i j k x x , x , x . 16 16
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Overview: Regression Techniques SVR and GMR lead to a regressive model that computes a weighted combination of local predictors. For a query point , predict the associated output : x y SVR Solution GMR Solution M K m * i i y k x x , b y x x i i i i 1 i 1 In SVR, the computation is reduced to summing only over the support vectors (a subset of datapoints) In GMR, the sum is over the set of Gaussians. The centers of the Gaussians are usually not located on any particular datapoint. The models are local m (x)! 17 17
APPLIED MACHINE LEARNING – 2011-2012 APPLIED MACHINE LEARNING Overview: Regression Techniques SVR and GMR lead to the following regressive model: 8 Gauss functions full covariance matrix GMR Solution K m i y x x i i 1 18 18
Recommend
More recommend