cse217 introduction to data science
play

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED - PowerPoint PPT Presentation

CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann RECAP: CLUSTERING Good clustering high similarity within each group low similarity across the groups minimize distance of each


  1. CSE217 INTRODUCTION TO DATA SCIENCE LECTURE 8: SIMILARITY-BASED PREDICTION Spring 2019 Marion Neumann

  2. RECAP: CLUSTERING • Good clustering • high similarity within each group • low similarity across the groups à minimize distance of each data points to its cluster center à we learn the grouping from the data based on similarities • no labels (no supervision ) 2

  3. SIMILARITIES FOR SUPERVISED ML • oftentimes clusters are used for prediction tasks • cluster news articles à recommend articles in group t I 3

  4. SIMILARITIES FOR SUPERVISED ML • What if we had class labels for the prediction task ? • train a classifier on labelled news articles à recommend articles with positive predicted label 4

  5. SIMILARITIES FOR CLASSIFICATION • New Idea: combine both ideas • use similarities to predict class label directly without computing clusters first • possible since we have observed class labels in our training data ( supervised learning) ME KNN classification 5

  6. SIMILARITIES FOR REGRESSION • This also works for regression: prediction 7 predict average q price among 3 NN KUN regression f 6

  7. K-NEAREST NEIGHBOR MODEL • Prediction classification ft muffle lb in C Wf E Regessian flxt EE.fi K IIE Nj isthe set of nearest neighbors of I where We 7

  8. K-NEAREST NEIGHBOR MODEL k test input v • Algorithm to find NNs training inputs INPUT K I in d i takethe first k data points as initial K NN WI fl k indiceesONLY where di D I Ii de Ed die max D wax de argma de max id FOR i k 11 n a maid II DIE E E VVI max id de maxi D Eitc d Imax.d nEgYE ENDENDh.at id 8 RETURN WE

  9. K-NN DISCUSSION quickfire training slow at lest time lazy learner simple have to select k explainable need to store entire same for regression training data D KiisiBia n classification for test prediction multi class classification 1 huge model size 9

  10. HOW TO SET K? keep DTE model selection for evaluation T DvatD use validation set FOR K I Kmax FOR it in Xval K NN DTR ya select k egg a g perf y qq.ayggq.gg y END onseyalidation argyrax perf k k 10

  11. CROSS-VALIDATION (CV) a fixed Validation set Dna Dirac Using of class discussion split has issues cross validation instead solution perform FOR k I Kmax I FOR f I num folds YET KDTE in Xilie For K NN D 1 g out END Dv avgperfly't perf k targ perf perfKf perf k II END play k argynax perf k µ RD 11 CV on DTRIDTE can be used for model Companion

  12. SIMILARITY-BASED METHODS ACTIVITY 2 • k-NN classification or regression • Clustering/k-means If a variable is measured at a higher scale than the other variables, then whatever measure we use will be overly influenced by that variable. 12

  13. DATA TRANSFORMATIONS min-max scaled ! # • Min-Max scaling • Centering • Standardization ! " standardized 13

  14. SUMMARY & READING • K-NN is an extremely simple and versatile model for supervised machine learning. • K-NN is a lazy learner : we do not learn/train a model, we simply use the data directly for predictions. • Cross-Validation is a better way to evaluate ML models or to perform model selection . • [ DSFS ] • Ch12: k-NN • Ch10: Working with data à Rescaling (p132-133) • [ PDSH ] • Ch5: Hyperparameters and Model Validation • Thinking about Model Validation [cross-validation] (p359-362 ) 14

Recommend


More recommend