Y X • Returning to our data, let’s model Play instead of Temp Age Temp Rainy Play • Again, we divide our data and Y 22 91 N learn how data X is related to 29 89 N Y data Y 31 56 N N • Again, assert: ! = # $ + & 23 71 N Y Y 37 72 N 41 83 N Y 29 97 Y Y 21 64 N N 30 N Y 68
Y X • Returning to our data, let’s model Play instead of Temp Age Temp Rainy Play • Again, we divide our data and Y 22 91 N learn how data X is related to 29 89 N Y data Y 31 56 N N • Again, assert: ! = # $ + & 23 71 N Y • Want a model that is: Y 37 72 N • Supervised 41 83 N Y • Predicts categories/classes 29 97 Y Y (classification model) 21 64 N N • Q : What model could we use? 30 N Y 68
Age Temp Play Linear Regression
Age Temp Play Linear Logistic Regression Regression
Logistic Regression Y ! " X High-level
Logistic Regression Y N ' ' ((*) 1 ! " ' = 1 + 1 2(3 4 53 6 7 6 53 8 7 8 + 3 9 7 9 ) = :(;<) , - , $ , % , & 22 91 Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)
Logistic Regression Y This is a non-linear activation function , called a sigmoid . N ' Yet, our overall model is still considered linear w.r.t. the 9 coefficients. It’s a generalized linear model . 1 ! " ' = 1 + + ,(. / 0. 1 2 1 0. 3 2 3 + . 4 2 4 ) = 6(78) 22 91 Y X # $ # % # & Mathematically High-level
Logistic Regression When training our model, how do we Q1 N measure its 0 predictions 1 2 ? & A1 Cost function J " = −[& log * & + (1 − &) log(1 − * &)] & = 7(89) “Cross-Entropy” aka “Log loss” 22 91 Y 3 4 3 5 3 6 Mathematically
Logistic Regression When training our model, how do we Q1 N measure its 0 predictions 1 2 ? & A1 Cost function J " = −[& log * & + (1 − &) log(1 − * &)] & = 7(89) “Cross-Entropy” aka “Log loss” How do we find the optimal " so that Q2 we yield the best predictions? 22 91 Y 3 4 3 5 3 6 Scikits has many optimization solvers: A2 (e.g., liblinear, newton-cg, saga, etc) Mathematically
Logistic Regression Fitted model example The plane is chosen to minimize the error of our class probabilities (per our loss function, cross- entropy ) and the true labels (mapped to 0 or 1 ) Photo from http://strijov.com/sources/demoDataGen.php (Dr. Vadim Strijov)
Parametric Models • So far, we’ve assumed our data X and Y can be represented by an underlying model ! (i.e., " = ! $ + & ) that has a particular form (e.g., a linear relationship, hence our using a linear model) • Next, we aimed to fit the model ! by estimating its parameters ' (we did so in a supervised manner)
Parametric Models • So far, we’ve assumed our data X and Y can be represented by an underlying model ! (i.e., " = ! $ + & ) that has a particular form (e.g., a linear relationship, hence our using a linear model) • Next, we aimed to fit the model ! by estimating its parameters ' (we did so in a supervised manner) • Parametric models make the above assumptions. Namely, that there exists an underlying model ! that has a fixed number of parameters.
Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric
Non-Parametric Models Alternatively, what if we make no assumptions about the underlying model ! ? Specifically, let’s not assume ! : • has any particular distribution/shape (e.g., Gaussian, linear relationship, etc.) • can be represented by a finite number of parameters.
Non-Parametric Models Alternatively, what if we make no assumptions about the underlying model ! ? Specifically, let’s not assume ! : • has any particular distribution/shape (e.g., Gaussian, linear relationship, etc.) • can be represented by a finite number of parameters. This would constitute a non-parametric model.
Non-Parametric Models • Non-parametric models are allowed to have parameters ; in fact, oftentimes the # of parameters grows as our amount of training data increases • Since they make no strong assumptions about the form of the function/model, they are free to learn any functional form from the training data – infinitely complex.
Y X • Returning to our data, let’s again Age Temp Rainy Play predict if a person will Play Y 22 91 N • If we do not want to assume 29 89 N Y anything about how X and Y relate, we could use a different 31 56 N N supervised model 23 71 N Y Y 37 72 N • Suppose we do not care to build a decision boundary but merely 41 83 N Y want to make predictions based 29 97 Y Y on similar data that we saw 21 64 N N during training 30 N Y 68
Age Temp Play Linear Logistic Regression Regression
Age Temp Play Linear Logistic k-NN Regression Regression
k-NN N * Refresher: • k-NN doesn’t train a model * = ,(./) • One merely specifies a ! value 22 Y 91 & ' & ( & ) • At test time, a new piece of data " : Mathematically • must be compared to all other training data # , to determine its k-nearest neighbors, per some distance metric $ ", # • is classified as being the majority class (if categorical) or average (if quantitative) of its k-neighbors
k-NN Conclusion: • k-NN makes no assumptions about the data ! or the form of "(!) • k-NN is a non-parametric model
k-NN CONS PROS • Intuitive and simple approach • Can be very computationally expensive if the data is large or • Can model any type of data / high-dimensional places no assumptions on the data • Should carefully think about • Fairly robust to missing data features, including scaling them • Good for highly sparse data • Mixing quantitative and categorical (e.g., user data, where the columns are data can be tricky thousands of potential items of interest) • Interpretation isn’t meaningful • Often, regression models are better, especially with little data
Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised
Y X • Returning to our data yet again , Age Temp Rainy Play let’s predict if a person will Play Y 22 91 N • If we do not want to assume 29 89 N Y anything about how X and Y relate, believing that no single 31 56 N N equation can model the 23 71 N Y possibly non-linear relationship Y 37 72 N • Suppose we just want our 41 83 N Y model to have robust decision 29 97 Y Y boundaries with interpretable 21 64 N N results 30 N Y 68
Age Temp Play Linear Logistic k-NN Regression Regression
Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Decision Tree Refresher: • A Decision Tree iteratively determines how to split our data by the best feature value so as to minimize the entropy (uncertainty) of our resulting sets. • Must specify the: • Splitting criterion (e.g., Gini index, Information Gain) • Stopping criterion (e.g., tree depth, Information Gain Threshold)
Decision Tree Refresher: Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor).
Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised ? ? ? Decision Tree
Decision Tree • A Decision Tree makes no distributional assumptions about the data. • The number of parameters / shape of the tree depends entirely on the data (i.e., imagine data that is perfectly separable into disjoint sections by features, vs data that is highly complex with overlapping values) • Decision Trees make use of the full data ( X and Y ) and can handle Y values that are categorical or quantitative
Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised Non-Parametric Decision Tree Supervised either
Your Data X Age Play Rainy Temp • Returning to our full dataset ! , imagine we do not wish to 22 N Y 91 leverage any particular 29 Y N 89 column " , but merely wish to 31 N N 56 transform the data into a 23 Y N 71 smaller, useful representation 37 N Y 72 # 41 Y N 83 ! = %(!) 29 Y Y 97 21 N N 64 30 N 68 Y
Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Principal Component Analysis (PCA) Refresher: • PCA isn’t a model per se but is a procedure/technique to transform data, which may have correlated features, into a new, smaller set of uncorrelated features • These new features, by design, are a linear combination of the original features so as to capture the most variance • Often useful to perform PCA on data before using models that explicitly use data values and distances between them (e.g., clustering)
Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised Non-Parametric Decision Tree Supervised either PCA Non-Parametric Unsupervised neither
Your Data X Age Play Rainy Temp • Returning to our full dataset ! yet again, imagine we do not 22 N Y 91 wish to leverage any particular 29 Y N 89 column " , but merely wish to 31 N N 56 discern patterns/groups of 23 Y N 71 similar observations 37 N Y 72 41 Y N 83 29 Y Y 97 21 N N 64 30 N 68 Y
PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Clustering PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Clustering Refresher: • There are many approaches to clustering (e.g., k-Means, hierarchical, DBScan) • Regardless of the approach, we need to specify a distance metric (e.g., Euclidean, Manhattan) • Performance : we can measure the intra-cluster and outer-cluster fit (i.e., silhouette score), along with an estimate that compares our clustering to the situation had our data been randomly generated (gap statistic)
Clustering k-Means example: • Although we are not explicitly using any column ! , one could imagine that the 3 resulting cluster labels are our ! ’s (labels being class 1 , 2 , and 3 ) • Of course, we do not know these class labels ahead of time, as clustering is an unsupervised model Visual Representation
Clustering k-Means example: • Although we are not explicitly using any column ! , one could imagine that the 3 resulting cluster labels are our ! ’s (labels being class 1 , 2 , and 3 ) • Of course, we do not know these class labels ahead of time, as clustering is an unsupervised model • Yet, one could imagine a narrative whereby our data points were generated by these 3 classes. Visual Representation
Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P " # : • Imagine centroids for each of the 3 clusters " % . 3 We assert that the data # were generated from ". 2 • We can estimate the joint probability of P(", #) Visual Representation
Clustering k-Means example: • That is, we are flipping the modelling process on its Assuming our data was generated from 1 head; instead of doing our traditional supervised Gaussians centered at 3 centroids, we can modelling approach of trying to estimate P(#|%) estimate the probability of the current situation – that the data % exists and has the following class • Imagine centroids for each of the 3 clusters # ' . labels # . This is a generative model. 3 We assert that the data % were generated from #. 2 • We can estimate the joint probability of P(#, %) Visual Representation
Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised Generative models explicitly model the modelling approach of trying to estimate P(#|%) actual distribution of each class (e.g., data and its cluster assignments). • Imagine centroids for each of the 3 clusters # ' . 3 We assert that the data % were generated from #. 2 • We can estimate the joint probability of P(#, %) Visual Representation
Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P " # : • Imagine centroids for each of the 3 clusters " % . 3 We assert that the data # were generated from ". 2 • We can estimate the joint probability of P(", #) Visual Representation
Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P(#|%) : • Imagine centroids for each of the 3 clusters # ' . We Supervised models are given some data % and want 3 assert that the data % were generated from #. 2 to calculate the probability of # . They learn to discriminate between different values • We can estimate the joint probability of P(#, %) of possible # ’s (learns a decision boundary ). Visual Representation
Generative vs Discriminative Models To recap: By definition, a generative model is concerned with estimating the joint probability of P(#, %) By definition, a discriminative model is concerned with estimating the conditional probability P(#|%)
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Particularly, k-Means is Non-Parametric Decision Tree Supervised either Discriminative generative, as it can be seen as a special case of PCA Non-Parametric Unsupervised neither neither Gaussian Mixture Models Clustering Unsupervised neither Non-Parametric Generative
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Given training ! , learns to Logistic Regression Classification Supervised Parametric Discriminative discriminate between possible " values (quantitative) either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative Given training ! , learns to discriminate between possible either Non-Parametric k-NN Supervised Discriminative " classes (categorical) Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Given training ! , learns to discriminate between possible " Non-Parametric Decision Tree Supervised either Discriminative values (quantitative or categorical) PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative Given training ! , learns decision boundaries so as to discriminate PCA Non-Parametric Unsupervised neither neither between possible " values Clustering Unsupervised neither Non-Parametric Generative (quantitative or categorical)
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither PCA is a process , not a model, so it doesn’t make sense to consider it as a Clustering Unsupervised neither Non-Parametric Generative Discriminate or Generative model
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative
Y X • Returning our data yet again, Temp Age Play Rainy perhaps we’ve plotted our data 91 22 N Y X and see it’s non-linear 89 29 Y N 56 31 N N • Knowing how unnatural and 71 23 Y N finnicky polynomial regression 72 37 N Y can be, we prefer to let our 83 41 Y N model learn how to make its 97 29 Y Y own non-linear functions for 64 21 N N each feature ! " 68 30 N Y
Clustering PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Clustering PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Generalized Additive Models (GAMs) Refresher: Not our data, but imagine it’s plotting age vs temp :
Generalized Additive Models (GAMs) Refresher: We can make the line smoother by using a cubic spline or “ B-spline ” • Imagine having 3 of these models: • Not our data, but imagine ! " #$% • it’s plotting age vs Temp : ! & , ()#* • ! + ,#-.* • We can model Temp as: • Temp = 0 1 + ! " #$% + ! & ()#* + ! + ,#-.*
Generalized Additive Models (GAMs) Y 91 ' ' ( * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)
Generalized Additive Models (GAMs) Y It is called an additive model because we calculate a 91 ' separate 0 ; for each = ; , and then add together all of their ' ( contributions . * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)
Generalized Additive Models (GAMs) Y It is called an additive model because we calculate a 91 ' separate 0 ; for each = ; , and then add together all of their ' ( contributions . * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 0 ; doesn’t have to be a spline; can be any regression model 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)
Generalized Additive Models (GAMs) CONS PROS • Restricted to being additive; • Fits a non-linear function ! " to each feature # " important interactions may not be captured • Much easier than guessing polynomial terms and multinomial • Providing interactions via interaction terms. ! ' ()*, ,("-$ can only capture so much, a la multinomial interaction • Model is additive, allowing us to terms exam the effects of each # " on $ by holding the other features # %&" constant • The smoothness is easy to adjust
Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative GAMs Supervised either Parametric Discriminative
Y X • Returning our data yet again, Age Temp Rainy Play perhaps we’ve plotted our data X and see it’s non-linear Y 22 91 N • We further suspect that there are 29 89 N Y complex interactions that cannot 31 56 N N be represented by polynomial regression and GAMs 23 71 N Y Y 37 72 N • We just want great results and 41 83 N Y don’t care about interpretability 29 97 Y Y 21 64 N N 30 N Y 68
Clustering PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Feed- Clustering Forward Neural Net PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree
Feed-Forward Neural Network Y N ' 1 1 + + ,(. / 0. 1 2 1 0. 3 2 3 ) = 5(6 % 7) ' = ! " 1 1 + + ,(. / 0. 1 : 1 0. 3 : 3 + . ; : ; ) = 5(6 $ <) ℎ 9 = 22 N Y X # $ # % # & High-level Mathematically
Recommend
More recommend