the roadmap
play

The Roadmap: a recap of where weve been, where were heading, and - PowerPoint PPT Presentation

The Roadmap: a recap of where weve been, where were heading, and how its all related Harva vard IACS CS109B Chris Tanner, Pavlos Protopapas, Mark Glickman Learning Objectives Recap models from CS CS109A and CS CS109B Understand the


  1. Y X • Returning to our data, let’s model Play instead of Temp Age Temp Rainy Play • Again, we divide our data and Y 22 91 N learn how data X is related to 29 89 N Y data Y 31 56 N N • Again, assert: ! = # $ + & 23 71 N Y Y 37 72 N 41 83 N Y 29 97 Y Y 21 64 N N 30 N Y 68

  2. Y X • Returning to our data, let’s model Play instead of Temp Age Temp Rainy Play • Again, we divide our data and Y 22 91 N learn how data X is related to 29 89 N Y data Y 31 56 N N • Again, assert: ! = # $ + & 23 71 N Y • Want a model that is: Y 37 72 N • Supervised 41 83 N Y • Predicts categories/classes 29 97 Y Y (classification model) 21 64 N N • Q : What model could we use? 30 N Y 68

  3. Age Temp Play Linear Regression

  4. Age Temp Play Linear Logistic Regression Regression

  5. Logistic Regression Y ! " X High-level

  6. Logistic Regression Y N ' ' ((*) 1 ! " ' = 1 + 1 2(3 4 53 6 7 6 53 8 7 8 + 3 9 7 9 ) = :(;<) , - , $ , % , & 22 91 Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)

  7. Logistic Regression Y This is a non-linear activation function , called a sigmoid . N ' Yet, our overall model is still considered linear w.r.t. the 9 coefficients. It’s a generalized linear model . 1 ! " ' = 1 + + ,(. / 0. 1 2 1 0. 3 2 3 + . 4 2 4 ) = 6(78) 22 91 Y X # $ # % # & Mathematically High-level

  8. Logistic Regression When training our model, how do we Q1 N measure its 0 predictions 1 2 ? & A1 Cost function J " = −[& log * & + (1 − &) log(1 − * &)] & = 7(89) “Cross-Entropy” aka “Log loss” 22 91 Y 3 4 3 5 3 6 Mathematically

  9. Logistic Regression When training our model, how do we Q1 N measure its 0 predictions 1 2 ? & A1 Cost function J " = −[& log * & + (1 − &) log(1 − * &)] & = 7(89) “Cross-Entropy” aka “Log loss” How do we find the optimal " so that Q2 we yield the best predictions? 22 91 Y 3 4 3 5 3 6 Scikits has many optimization solvers: A2 (e.g., liblinear, newton-cg, saga, etc) Mathematically

  10. Logistic Regression Fitted model example The plane is chosen to minimize the error of our class probabilities (per our loss function, cross- entropy ) and the true labels (mapped to 0 or 1 ) Photo from http://strijov.com/sources/demoDataGen.php (Dr. Vadim Strijov)

  11. Parametric Models • So far, we’ve assumed our data X and Y can be represented by an underlying model ! (i.e., " = ! $ + & ) that has a particular form (e.g., a linear relationship, hence our using a linear model) • Next, we aimed to fit the model ! by estimating its parameters ' (we did so in a supervised manner)

  12. Parametric Models • So far, we’ve assumed our data X and Y can be represented by an underlying model ! (i.e., " = ! $ + & ) that has a particular form (e.g., a linear relationship, hence our using a linear model) • Next, we aimed to fit the model ! by estimating its parameters ' (we did so in a supervised manner) • Parametric models make the above assumptions. Namely, that there exists an underlying model ! that has a fixed number of parameters.

  13. Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric

  14. Non-Parametric Models Alternatively, what if we make no assumptions about the underlying model ! ? Specifically, let’s not assume ! : • has any particular distribution/shape (e.g., Gaussian, linear relationship, etc.) • can be represented by a finite number of parameters.

  15. Non-Parametric Models Alternatively, what if we make no assumptions about the underlying model ! ? Specifically, let’s not assume ! : • has any particular distribution/shape (e.g., Gaussian, linear relationship, etc.) • can be represented by a finite number of parameters. This would constitute a non-parametric model.

  16. Non-Parametric Models • Non-parametric models are allowed to have parameters ; in fact, oftentimes the # of parameters grows as our amount of training data increases • Since they make no strong assumptions about the form of the function/model, they are free to learn any functional form from the training data – infinitely complex.

  17. Y X • Returning to our data, let’s again Age Temp Rainy Play predict if a person will Play Y 22 91 N • If we do not want to assume 29 89 N Y anything about how X and Y relate, we could use a different 31 56 N N supervised model 23 71 N Y Y 37 72 N • Suppose we do not care to build a decision boundary but merely 41 83 N Y want to make predictions based 29 97 Y Y on similar data that we saw 21 64 N N during training 30 N Y 68

  18. Age Temp Play Linear Logistic Regression Regression

  19. Age Temp Play Linear Logistic k-NN Regression Regression

  20. k-NN N * Refresher: • k-NN doesn’t train a model * = ,(./) • One merely specifies a ! value 22 Y 91 & ' & ( & ) • At test time, a new piece of data " : Mathematically • must be compared to all other training data # , to determine its k-nearest neighbors, per some distance metric $ ", # • is classified as being the majority class (if categorical) or average (if quantitative) of its k-neighbors

  21. k-NN Conclusion: • k-NN makes no assumptions about the data ! or the form of "(!) • k-NN is a non-parametric model

  22. k-NN CONS PROS • Intuitive and simple approach • Can be very computationally expensive if the data is large or • Can model any type of data / high-dimensional places no assumptions on the data • Should carefully think about • Fairly robust to missing data features, including scaling them • Good for highly sparse data • Mixing quantitative and categorical (e.g., user data, where the columns are data can be tricky thousands of potential items of interest) • Interpretation isn’t meaningful • Often, regression models are better, especially with little data

  23. Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised

  24. Y X • Returning to our data yet again , Age Temp Rainy Play let’s predict if a person will Play Y 22 91 N • If we do not want to assume 29 89 N Y anything about how X and Y relate, believing that no single 31 56 N N equation can model the 23 71 N Y possibly non-linear relationship Y 37 72 N • Suppose we just want our 41 83 N Y model to have robust decision 29 97 Y Y boundaries with interpretable 21 64 N N results 30 N Y 68

  25. Age Temp Play Linear Logistic k-NN Regression Regression

  26. Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  27. Decision Tree Refresher: • A Decision Tree iteratively determines how to split our data by the best feature value so as to minimize the entropy (uncertainty) of our resulting sets. • Must specify the: • Splitting criterion (e.g., Gini index, Information Gain) • Stopping criterion (e.g., tree depth, Information Gain Threshold)

  28. Decision Tree Refresher: Each comparison and branching represents splitting a region in the feature space on a single feature. Typically, at each iteration, we split once along one dimension (one predictor).

  29. Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised ? ? ? Decision Tree

  30. Decision Tree • A Decision Tree makes no distributional assumptions about the data. • The number of parameters / shape of the tree depends entirely on the data (i.e., imagine data that is perfectly separable into disjoint sections by features, vs data that is highly complex with overlapping values) • Decision Trees make use of the full data ( X and Y ) and can handle Y values that are categorical or quantitative

  31. Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised Non-Parametric Decision Tree Supervised either

  32. Your Data X Age Play Rainy Temp • Returning to our full dataset ! , imagine we do not wish to 22 N Y 91 leverage any particular 29 Y N 89 column " , but merely wish to 31 N N 56 transform the data into a 23 Y N 71 smaller, useful representation 37 N Y 72 # 41 Y N 83 ! = %(!) 29 Y Y 97 21 N N 64 30 N 68 Y

  33. Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  34. PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  35. Principal Component Analysis (PCA) Refresher: • PCA isn’t a model per se but is a procedure/technique to transform data, which may have correlated features, into a new, smaller set of uncorrelated features • These new features, by design, are a linear combination of the original features so as to capture the most variance • Often useful to perform PCA on data before using models that explicitly use data values and distances between them (e.g., clustering)

  36. Regression vs Supervised vs Parametric vs Classification Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Logistic Regression Classification Supervised Parametric either Non-Parametric k-NN Supervised Non-Parametric Decision Tree Supervised either PCA Non-Parametric Unsupervised neither

  37. Your Data X Age Play Rainy Temp • Returning to our full dataset ! yet again, imagine we do not 22 N Y 91 wish to leverage any particular 29 Y N 89 column " , but merely wish to 31 N N 56 discern patterns/groups of 23 Y N 71 similar observations 37 N Y 72 41 Y N 83 29 Y Y 97 21 N N 64 30 N 68 Y

  38. PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  39. Clustering PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  40. Clustering Refresher: • There are many approaches to clustering (e.g., k-Means, hierarchical, DBScan) • Regardless of the approach, we need to specify a distance metric (e.g., Euclidean, Manhattan) • Performance : we can measure the intra-cluster and outer-cluster fit (i.e., silhouette score), along with an estimate that compares our clustering to the situation had our data been randomly generated (gap statistic)

  41. Clustering k-Means example: • Although we are not explicitly using any column ! , one could imagine that the 3 resulting cluster labels are our ! ’s (labels being class 1 , 2 , and 3 ) • Of course, we do not know these class labels ahead of time, as clustering is an unsupervised model Visual Representation

  42. Clustering k-Means example: • Although we are not explicitly using any column ! , one could imagine that the 3 resulting cluster labels are our ! ’s (labels being class 1 , 2 , and 3 ) • Of course, we do not know these class labels ahead of time, as clustering is an unsupervised model • Yet, one could imagine a narrative whereby our data points were generated by these 3 classes. Visual Representation

  43. Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P " # : • Imagine centroids for each of the 3 clusters " % . 3 We assert that the data # were generated from ". 2 • We can estimate the joint probability of P(", #) Visual Representation

  44. Clustering k-Means example: • That is, we are flipping the modelling process on its Assuming our data was generated from 1 head; instead of doing our traditional supervised Gaussians centered at 3 centroids, we can modelling approach of trying to estimate P(#|%) estimate the probability of the current situation – that the data % exists and has the following class • Imagine centroids for each of the 3 clusters # ' . labels # . This is a generative model. 3 We assert that the data % were generated from #. 2 • We can estimate the joint probability of P(#, %) Visual Representation

  45. Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised Generative models explicitly model the modelling approach of trying to estimate P(#|%) actual distribution of each class (e.g., data and its cluster assignments). • Imagine centroids for each of the 3 clusters # ' . 3 We assert that the data % were generated from #. 2 • We can estimate the joint probability of P(#, %) Visual Representation

  46. Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P " # : • Imagine centroids for each of the 3 clusters " % . 3 We assert that the data # were generated from ". 2 • We can estimate the joint probability of P(", #) Visual Representation

  47. Clustering k-Means example: • That is, we are flipping the modelling process on its 1 head; instead of doing our traditional supervised modelling approach of trying to estimate P(#|%) : • Imagine centroids for each of the 3 clusters # ' . We Supervised models are given some data % and want 3 assert that the data % were generated from #. 2 to calculate the probability of # . They learn to discriminate between different values • We can estimate the joint probability of P(#, %) of possible # ’s (learns a decision boundary ). Visual Representation

  48. Generative vs Discriminative Models To recap: By definition, a generative model is concerned with estimating the joint probability of P(#, %) By definition, a discriminative model is concerned with estimating the conditional probability P(#|%)

  49. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative

  50. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Particularly, k-Means is Non-Parametric Decision Tree Supervised either Discriminative generative, as it can be seen as a special case of PCA Non-Parametric Unsupervised neither neither Gaussian Mixture Models Clustering Unsupervised neither Non-Parametric Generative

  51. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Given training ! , learns to Logistic Regression Classification Supervised Parametric Discriminative discriminate between possible " values (quantitative) either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative

  52. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative Given training ! , learns to discriminate between possible either Non-Parametric k-NN Supervised Discriminative " classes (categorical) Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative

  53. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Given training ! , learns to discriminate between possible " Non-Parametric Decision Tree Supervised either Discriminative values (quantitative or categorical) PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative

  54. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative Given training ! , learns decision boundaries so as to discriminate PCA Non-Parametric Unsupervised neither neither between possible " values Clustering Unsupervised neither Non-Parametric Generative (quantitative or categorical)

  55. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither PCA is a process , not a model, so it doesn’t make sense to consider it as a Clustering Unsupervised neither Non-Parametric Generative Discriminate or Generative model

  56. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative

  57. Y X • Returning our data yet again, Temp Age Play Rainy perhaps we’ve plotted our data 91 22 N Y X and see it’s non-linear 89 29 Y N 56 31 N N • Knowing how unnatural and 71 23 Y N finnicky polynomial regression 72 37 N Y can be, we prefer to let our 83 41 Y N model learn how to make its 97 29 Y Y own non-linear functions for 64 21 N N each feature ! " 68 30 N Y

  58. Clustering PCA Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  59. Clustering PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  60. Generalized Additive Models (GAMs) Refresher: Not our data, but imagine it’s plotting age vs temp :

  61. Generalized Additive Models (GAMs) Refresher: We can make the line smoother by using a cubic spline or “ B-spline ” • Imagine having 3 of these models: • Not our data, but imagine ! " #$% • it’s plotting age vs Temp : ! & , ()#* • ! + ,#-.* • We can model Temp as: • Temp = 0 1 + ! " #$% + ! & ()#* + ! + ,#-.*

  62. Generalized Additive Models (GAMs) Y 91 ' ' ( * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)

  63. Generalized Additive Models (GAMs) Y It is called an additive model because we calculate a 91 ' separate 0 ; for each = ; , and then add together all of their ' ( contributions . * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)

  64. Generalized Additive Models (GAMs) Y It is called an additive model because we calculate a 91 ' separate 0 ; for each = ; , and then add together all of their ' ( contributions . * + 1 1 1 ! ! ! ! " $ & % ' = - . + 0 1 234 + 0 5 6728 + 0 9 :2;<8 1 1 1 0 ; doesn’t have to be a spline; can be any regression model 22 N Y # $ # % # & X # $ # % # & Graphically Mathematically High-level (NN format)

  65. Generalized Additive Models (GAMs) CONS PROS • Restricted to being additive; • Fits a non-linear function ! " to each feature # " important interactions may not be captured • Much easier than guessing polynomial terms and multinomial • Providing interactions via interaction terms. ! ' ()*, ,("-$ can only capture so much, a la multinomial interaction • Model is additive, allowing us to terms exam the effects of each # " on $ by holding the other features # %&" constant • The smoothness is easy to adjust

  66. Regression vs Generative vs Supervised vs Parametric vs Classification Discriminative Unsupervised Non-Parametric Linear Regression Supervised Regression Parametric Discriminative Logistic Regression Classification Supervised Parametric Discriminative either Non-Parametric k-NN Supervised Discriminative Non-Parametric Decision Tree Supervised either Discriminative PCA Non-Parametric Unsupervised neither neither Clustering Unsupervised neither Non-Parametric Generative GAMs Supervised either Parametric Discriminative

  67. Y X • Returning our data yet again, Age Temp Rainy Play perhaps we’ve plotted our data X and see it’s non-linear Y 22 91 N • We further suspect that there are 29 89 N Y complex interactions that cannot 31 56 N N be represented by polynomial regression and GAMs 23 71 N Y Y 37 72 N • We just want great results and 41 83 N Y don’t care about interpretability 29 97 Y Y 21 64 N N 30 N Y 68

  68. Clustering PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  69. Feed- Clustering Forward Neural Net PCA GAMs Age Temp Play Linear Logistic Decision k-NN Regression Regression Tree

  70. Feed-Forward Neural Network Y N ' 1 1 + + ,(. / 0. 1 2 1 0. 3 2 3 ) = 5(6 % 7) ' = ! " 1 1 + + ,(. / 0. 1 : 1 0. 3 : 3 + . ; : ; ) = 5(6 $ <) ℎ 9 = 22 N Y X # $ # % # & High-level Mathematically

Recommend


More recommend