CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu January 19, 2016

Announcements • Team formation due next Wednesday • Homework 1 out by tomorrow 2

Today’s Schedule • Course Project Introduction • Linear Regression Model • Decision Tree 3

Methods to Learn Matrix Data Text Set Data Sequence Time Series Graph & Images Data Data Network Classification Decision Tree ; HMM Label Neural Naïve Bayes; Propagation Network Logistic Regression SVM; kNN Clustering K-means; PLSA SCAN; hierarchical Spectral clustering; Clustering DBSCAN; Mixture Models; kernel k- means* Frequent Apriori; GSP; FP-growth PrefixSpan Pattern Mining Linear Regression Autoregression Collaborative Prediction Filtering Similarity DTW P-PageRank Search Ranking PageRank 4

How to learn these algorithms? • Three levels • When it is applicable? • Input, output, strengths, weaknesses, time complexity • How it works? • Pseudo-code, work flows, major steps • Can work out a toy problem by pen and paper • Why it works? • Intuition, philosophy, objective, derivation, proof 5

Matrix Data: Prediction • Matrix Data • Linear Regression Model • Model Evaluation and Selection • Summary 6

Example   A matrix of n × 𝑞 : x ... x ... x 11 1f 1p   • n data objects / points   ... ... ... ... ...   • p attributes / dimensions x ... x ... x   i1 if ip   ... ... ... ... ...    x ... x ... x    n1 nf np 7

Attribute Type • Numerical • E.g., height, income • Categorical / discrete • E.g., Sex, Race 8

Categorical Attribute Types • Nominal: categories, states, or “names of things” • Hair_color = { auburn, black, blond, brown, grey, red, white } • marital status, occupation, ID numbers, zip codes • Binary • Nominal attribute with only 2 states (0 and 1) • Symmetric binary: both outcomes equally important • e.g., gender • Asymmetric binary: outcomes not equally important. • e.g., medical test (positive vs. negative) • Convention: assign 1 to most important outcome (e.g., HIV positive) • Ordinal • Values have a meaningful order (ranking) but magnitude between successive values is not known. • Size = { small, medium, large } , grades, army rankings 9

Linear Regression • Ordinary Least Square Regression • Closed form solution • Gradient descent • Linear Regression with Probabilistic Interpretation 11

The Linear Regression Problem • Any Attributes to Continuous Value: x ⇒ y • {age; major ; gender; race} ⇒ GPA • {income; credit score; profession} ⇒ loan • {college; major ; GPA} ⇒ future income • ... 12

Illustration 13

Formalization • Data: n independent data objects • 𝑧 𝑗 , i = 1, … , 𝑜 T , i = 1, … , 𝑜 • 𝒚 𝑗 = 𝑦 𝑗0 , 𝑦 𝑗1 , 𝑦 𝑗2 , … , 𝑦 𝑗𝑞 • A constant factor is added to model the bias term , i. e. , 𝑦 𝑗0 = 1 • Model: • 𝑧: dependent variable • 𝒚: explanatory variables 𝑈 : 𝑥𝑓𝑗𝑕ℎ𝑢 𝑤𝑓𝑑𝑢𝑝𝑠 • 𝜸 = 𝛾 0 , 𝛾 1 , … , 𝛾 𝑞 • 𝑧 = 𝒚 𝑈 𝜸 = 𝛾 0 + 𝑦 1 𝛾 1 + 𝑦 2 𝛾 2 + ⋯ + 𝑦 𝑞 𝛾 𝑞 14

A 2-step Process • Model Construction • Use training data to find the best parameter 𝜸 , , denoted as 𝜸 • Model Usage • Model Evaluation • Use validation data to select the best model • Feature selection • Apply the model to the unseen data (test data): 𝑧 = 𝒚 𝑈 𝜸 15

Least Square Estimation • Cost function (Total Square Error): 2 𝑈 𝜸 − 𝑧 𝑗 • 𝐾 𝜸 = 𝑗 𝒚 𝑗 • Matrix form: • 𝐾 𝜸 = X𝜸 − 𝒛 𝑈 (𝑌𝜸 − 𝒛) 2 or X𝜸 − 𝒛   1 , x ... x ... x 𝑧 1 11 1f 1p     ... ... ... ... ... ⋮   𝑧 𝑗 1 , x ... x ... x   i1 if ip ⋮   ... ... ... ... ...   𝑧 𝑜  1 , x ... x ... x    n1 nf np 𝒀: 𝒐 × 𝒒 + 𝟐 matrix y : 𝒐 × 𝟐 𝐰𝐟𝐝𝐮𝐩𝐬 16

Ordinary Least Squares (OLS) • Goal: find 𝜸 that minimizes 𝐾 𝜸 • 𝐾 𝜸 = X𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝑧 = 𝜸 𝑈 𝑌 𝑈 𝑌𝜸 − 𝑧 𝑈 𝑌𝜸 − 𝜸 𝑈 𝑌 𝑈 𝑧 + 𝑧 𝑈 𝑧 • Ordinary least squares • Set first derivative of 𝐾 𝜸 as 0 𝜖𝐾 𝜖𝜸 = 2𝜸 𝑈 X T X − 2𝑧 𝑈 𝑌 = 0 • −1 𝑌 𝑈 𝑧 • ⇒ 𝜸 = 𝑌 𝑈 𝑌 17

Gradient Descent • Minimize the cost function by moving down in the steepest direction 18

Batch Gradient Descent • Move in the direction of steepest descend Repeat until converge { 𝜖𝐾 𝜸 (𝑢+1) := 𝜸 (t) − 𝜃 𝜖𝜸 𝜸=𝜸 (t) , e.g., 𝜃 = 0.1 } 2 = 𝑗 𝐾 𝑗 (𝜸) and 𝑈 𝜸 − 𝑧 𝑗 Where 𝐾 𝜸 = 𝑗 𝒚 𝑗 𝜖𝐾 𝜖𝐾 𝑗 𝑈 𝜸 − 𝑧 𝑗 ) 𝜖𝜸 = 𝜖𝜸 = 2𝒚 𝑗 (𝒚 𝑗 𝑗 𝑗 19

Stochastic Gradient Descent • When a new observation, i , comes in, update weight immediately (extremely useful for large- scale datasets): Repeat { for i=1:n { 𝜸 (𝑢+1) := 𝜸 (t) + 2𝜃(𝑧 𝑗 − 𝒚 𝑗 𝑈 𝜸 (𝑢) )𝒚 𝑗 } } If the prediction for object i is smaller than the real value, 𝜸 should move forward to the direction of 𝒚 𝑗 20

Other Practical Issues • What if 𝑌 𝑈 𝑌 is not invertible? • Add a small portion of identity matrix, λ 𝐽 , to it (ridge regression* ) • What if some attributes are categorical? • Set dummy variables • E.g., 𝑦 = 1, 𝑗𝑔 𝑡𝑓𝑦 = 𝐺; 𝑦 = 0, 𝑗𝑔 𝑡𝑓𝑦 = 𝑁 • Nominal variable with multiple values? • Create more dummy variables for one variable • What if non-linear correlation exists? • Transform features, say, 𝑦 to 𝑦 2 21

Probabilistic Interpretation • Review of normal distribution 2𝜌𝜏 2 𝑓 − 𝑦−𝜈 2 1 • X ~𝑂 𝜈, 𝜏 2 ⇒ 𝑔 𝑌 = 𝑦 = 2𝜏2 22

Probabilistic Interpretation 𝑈 𝛾 + ε 𝑗 • Model: 𝑧 𝑗 = 𝑦 𝑗 • ε 𝑗 ~𝑂(0, 𝜏 2 ) 𝑈 𝛾, 𝜏 2 ) • 𝑧 𝑗 𝑦 𝑗 , 𝛾~𝑂(𝑦 𝑗 𝑈 𝛾 • 𝐹 𝑧 𝑗 𝑦 𝑗 = 𝑦 𝑗 • Likelihood: • 𝑀 𝜸 = 𝑗 𝑞 𝑧 𝑗 𝑦 𝑗 , 𝛾) 2 𝑈 𝜸 𝑧 𝑗 −𝒚 𝑗 1 = 𝑗 2𝜌𝜏 2 exp{− } 2𝜏 2 • Maximum Likelihood Estimation • find 𝜸 that maximizes L 𝜸 • arg max 𝑀 = arg min 𝐾 , Equivalent to OLS! 23

Model Selection Problem • Basic problem: • how to choose between competing linear regression models • Model too simple: • “ underfit ” the data; poor predictions; high bias; low variance • Model too complex: • “ overfit ” the data; poor predictions; low bias; high variance • Model just right: • balance bias and variance to get good predictions 25

Bias and Variance True predictor 𝑔 𝑦 : 𝑦 𝑈 𝜸 • Bias: 𝐹( 𝑔 𝑦 ) − 𝑔(𝑦) Estimated predictor 𝑔 𝑦 : 𝑦 𝑈 𝜸 • How far away is the expectation of the estimator to the true value? The smaller the better. 2 = 𝐹[ • Variance: 𝑊𝑏𝑠 𝑔 𝑦 𝑔 𝑦 − 𝐹 𝑔 𝑦 ] • How variant is the estimator? The smaller the better. • Reconsider mean square error 2 /𝑜 • 𝐾 𝑈 𝜸 /𝑜 = 𝑗 𝒚 𝑗 𝜸 − 𝑧 𝑗 • Can be considered as 2 ] = 𝑐𝑗𝑏𝑡 2 + 𝑤𝑏𝑠𝑗𝑏𝑜𝑑𝑓 + 𝑜𝑝𝑗𝑡𝑓 • 𝐹[ 𝑔 𝑦 − 𝑔(𝑦) − 𝜁 Note 𝐹 𝜁 = 0, 𝑊𝑏𝑠 𝜁 = 𝜏 2 26

Bias-Variance Trade-off 27

Cross-Validation • Partition the data into K folds • Use K-1 fold as training, and 1 fold as testing • Calculate the average accuracy best on K training-testing pairs • Accuracy on validation/test dataset! 2 /𝑜 𝑈 • Mean square error can again be used: 𝑗 𝒚 𝑗 𝜸 − 𝑧 𝑗 28

AIC & BIC* • AIC and BIC can be used to test the quality of statistical models • AIC (Aka kaike ike information formation criterion erion) • 𝐵𝐽𝐷 = 2𝑙 − 2ln( 𝑀) , • where k is the number of parameters in the model and 𝑀 is the likelihood under the estimated parameter • BIC (Bayesian Information criterion) • B 𝐽𝐷 = 𝑙𝑚𝑜(𝑜) − 2ln( 𝑀) , • Where n is the number of objects 29

Stepwise Feature Selection • Avoid brute-force selection • 2 𝑞 • Forward selection • Starting with the best single feature • Always add the feature that improves the performance best • Stop if no feature will further improve the performance • Backward elimination • Start with the full model • Always remove the feature that results in the best performance enhancement • Stop if removing any feature will get worse performance 30

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: - PowerPoint PPT Presentation

CS6220: DATA MINING TECHNIQUES Matrix Data: Prediction Instructor: Yizhou Sun yzsun@ccs.neu.edu January 19, 2016 Announcements Team formation due next Wednesday Homework 1 out by tomorrow 2 Todays Schedule Course Project

Link Analysis Lecture 7 Link Analysis November 29, 2017 1 CS6220 Data Mining Techniques

Web Mining Web Mining Web Mining Web Mining Web mining is the use of data mining techniques

CS6220: DATA MINING TECHNIQUES Chapter 7: Advanced Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Graph/Network Data: Part I Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Sequential and Time Series Data Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES Mining Time Series Data Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES Set Data: Frequent Pattern Mining Instructor: Yizhou Sun

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

CS6220: DATA MINING TECHNIQUES 2: Data Pre-Processing Instructor: Yizhou Sun yzsun@ccs.neu.edu

Web Mining Web Mining Web mining is the use of data mining techniques to automatically

CS6220: DATA MINING TECHNIQUES Image Data: Classification via Neural Networks Instructor: Yizhou

COMS 4721: Machine Learning for Data Science Lecture 24, 4/25/2017 Prof. John Paisley Department

Information leakage from black holes with symmetry Yoshifumi NAKATA Kyoto university E.

Transverse Spin Asymmetries in Neutral Strange Particle Production Thomas Burton Wed 3rd June

Epistemic Diversity and Editor Decisions: A Statistical Matthew Effect Remco Heesen 1 Jan-Willem

Second-Order Bias-Corrected AIC for Selecting Structural Equation Models Kentaro H AYASHI

Modern MDL meets Data Mining Insight, Theory, and Practice Jilles Kenji Vreeken Yamanishi

Ridge/Lasso Regression, Model selection Xuezhi Wang Computer Science Department Carnegie Mellon

Lecture 7: Cross-Validation Instructor: Prof. Shuai Huang Industrial and Systems Engineering